All articles
Agent Engineering··25 min

When measured pressure means something specific

Stage 5 is the chapter where the meter is the protagonist. Most teams will never write this post for their own product. The honesty is in knowing which side of that sentence you are on.

When measured pressure means something specific

Imagine a different apartment than the one in the first post. Same birch desk. Same lamp. The window faces the same street, but the season has rolled fully forward, and what is outside the window is the kind of late winter that has stopped pretending to be winter and has not yet committed to spring. The streetlights are on because the sun does not come up at this hour. The thermostat is set to sixty four because the heating bill has been a line item the founder reads now, alongside the dashboards.

It is 3am US Eastern time. The pager went off thirty seven minutes ago. The founder is sitting at the desk in a hoodie with the sleeves pulled over the hands, the way you sit when the apartment is quiet in the way only 3am apartments are quiet. The agent has been running for almost two years. The events table from the Friday night in the first post has more than seven million rows. The hash chain from the Stage 4 winter has walked clean in every CI run since the constraint was committed. The agent has paying customers. One of them is in Singapore, and the contract the founder signed five weeks ago carries a 200ms p99 SLA on every tool call.

On the second monitor there are three tabs. The first is a CloudWatch latency histogram, segmented by customer. The Singapore customer's p99 has been sitting at 380ms for the last week. The second is the contract itself, scrolled to the SLA clause, which the founder has read four times this morning the way you read a clause you wish said something else. The third is the database topology diagram. A single primary in us-east-1. A trans-pacific round trip from Singapore to us-east-1 of roughly 180 milliseconds at minimum, on a quiet wire, to the nearest edge. The arithmetic on the page is not in the founder's favor.

The first instinct, the one I want to name early so we can come back to it, is to spin up a read replica in ap-southeast-1. The instinct is directionally honest. The honest answer is more complicated, and the rest of the post is about the more complicated answer and the meter that earned the right to ask the question in the first place.

In the first post in this series I named four axes and a five stage operational ladder. Stage 1 was the small set. Stage 2 was tenant isolation turning on. Stage 3 was the consequence layer that paying customers earned. Stage 4 was the chapter where what your system can prove mattered more than what it does. Stage 5 is the chapter where the deferrals from Stage 1's "almost never" column either light up because a meter says they must, or stay where they were, permanently. There are more cells in that column that stay where they were than the founder reading this post wants to believe.

This is the last post in the series. It is also the post most agent teams should not write for their own product. My read is that the difference between the teams who legitimately write this post and the teams who write it because they want to be the kind of team that writes it is one specific thing, and the specific thing is the meter.

The Pacific between the primary and the customer, against the SLA the architecture cannot meet A diagrammatic schematic of the trans-Pacific path between two region cards: us-east-1 on the left, the primary; ap-southeast-1 on the right, the Singapore customer. A dashed purple arc bows upward between the cards like a great-circle route, with a pill at its apex carrying the trans-Pacific RTT of 180ms minimum. Below, a histogram panel shows the customer's p99 latency for the last thirty days; the first sixteen bars are muted grey, the last fourteen are purple and crest above a dashed red SLA threshold line at 200ms. To the right, a summary panel frames the architecture's relationship to the meter. One primary, one customer, and the ocean in between A 200ms p99 SLA against a 180ms trans-Pacific floor. The arithmetic on the page is not in the founder's favor. NETWORK PATH us-east-1 → ap-southeast-1 us-east-1 PRIMARY single writeable database ap-southeast-1 SINGAPORE CUSTOMER 200ms p99 SLA, signed TRANS-PACIFIC RTT 180ms minimum P99 LATENCY, LAST 30 DAYS Singapore SLA 200ms d-30 d-16, SLA crossed today, 380ms THE METER SAYS SO p99 SLA: 200ms A single-region architecture cannot meet a 200ms SLA from us-east-1 to Singapore. The wire is 180ms on a quiet day. The histogram has crested above the threshold for fourteen days. The contract is the deadline. The arithmetic is the trigger.

The meter is the protagonist

Before any Stage 5 decision lands, the meter has to be defensible. Every architectural extension this post will describe has a named meter in a named tool, and the meter has to have been showing the pressure for a stretch of time long enough that one bad week cannot explain it. My read is that 30 days is the right floor. One bad weekend is a debugging problem. Four bad weeks in a row are a pattern. The 30 day criterion is a forcing function against reacting to a spike, and I want to name it as my judgment rather than a sourced industry standard, because it is.

The meter for read pressure on the primary is pg_stat_statements, the Postgres extension that records execution statistics for the queries running against the database. When the top of the list by total time is dominated by read queries, consistently, the read replica conversation has earned its place. When the top is dominated by write queries, a read replica does nothing useful. The same dashboard answers two different questions, and the answer depends on which column you sort by.

The meter for region pressure is the latency histogram. On RDS the canonical view lives in CloudWatch's RDS metrics, segmented by customer or by region. The metric names that matter at Stage 5 are ReadLatency and WriteLatency at the database, and the application's own p99 histogram at the tool call boundary. The two together tell you whether the latency budget is being spent in the database or on the wire. A 380ms p99 for a customer whose database queries return in 12ms is a wire problem, and the architectural answer is a different shape than the one a database problem asks for.

The meter for sharding pressure is the database's own write throughput, segmented by customer or by table. When a single customer crosses 40% of total writes consistently, the sharding conversation earns its place. When the largest customer is 8% of writes, it does not.

The meter for semantic caching pressure is the embedding cost line item in the monthly infrastructure bill, joined to a cache hit rate analysis run against the actual query distribution. If the embedding cost is visible enough to be named in the monthly review, and the hit rate analysis crosses some measured threshold, the conversation can start. The threshold is workload dependent; I would use 30% as a rough illustrative floor, while being honest that the right number is the number your workload actually produces.

The meter for failover topology is unusual in that it can be a contractual term rather than a graph. If a customer contract names a recovery point objective and a recovery time objective, the contract is the meter. If a region failure has already happened, the failure is the meter. Failover topology is the one Stage 5 decision where "we should be ready in case" is not a sufficient trigger, because being ready is a quarter of work.

The frame I would write on the wall at this stage is the same frame every Stage 5 decision rests on. If you cannot name the specific metric, in the specific tool, and the metric has not been showing the pressure for 30 days or more, you are not at Stage 5. You are at Stage 4 with a Hacker News tab open in another window, and the work you are about to commit to is theater. Theater is expensive at Stage 5. The architecture is not reversible cheaply, the operational load is permanent, and the next funding conversation includes a question about infrastructure overhead that you will not have a good answer to.

The meter-first decision tree for any Stage 5 architecture proposal A vertical flowchart with an entry pill at the top and five sequential decision diamonds beneath it. Each diamond carries a yes or no answer; "no" answers exit left into amber-bordered deferral cards that name a smaller next step; "yes" answers continue down the central spine. The final node is a green-bordered "OK to architect" card. Two diamonds in the middle branch into both the architect-now and architect-on-meter cards, each of which still has to pass the reversal and ownership gates beneath them. Five questions between the proposal and the migration Each gate has a small next step on the side. Most teams fail at least two of the five, this quarter. ENTRY We think we need Stage 5 architecture GATE 1 Does a dashboard show the pressure? NO DEFER Do not architect. Build the meter first. YES GATE 2 Has it shown the pressure for at least 30 days? NO WAIT Watch the metric. Come back in 30 days. YES GATE 3 Does a customer contract name a term you cannot meet? NO YES ARCHITECT ON THE METER When the meter shows you've crossed your headroom, not before. ARCHITECT NOW Deadline is contractual. The clock is already running. GATE 4 Is there a reversal plan? (if no, treat as permanent) NO CAVEAT Decision is permanent. Consider twice. YES GATE 5 Named owner for the new operational load? NO HIRE FIRST Answer ownership first. The hire is the architecture. YES PROCEED OK to architect. The meter earned the work. Five gates, two outcomes. The right answer for most teams this quarter is to wait at gate 1 or gate 2.

The Singapore scenario, in slow motion

Back to the apartment. The p99 for Singapore is 380ms. The p99 for every other customer is between 60ms and 110ms, and has been fine for the eighteen months the agent has had customers outside the founder's apartment. The Singapore customer is the first customer on the wrong side of an ocean.

The first instinct is the read replica in ap-southeast-1. The instinct reads the geography correctly. The instinct misreads the workload, and I want to walk why because the directional answer is so close to right that the commitment can happen before the write path is checked.

The tool calls missing the SLA are not read tool calls. They are write tool calls that read before they write and write before they respond. The agent receives the Singapore request, looks up context, computes a side effect (an updated record, an event written to the audit chain), writes the side effect, and returns to the caller. The write goes to the primary in us-east-1. The trans-pacific round trip for the write is in the critical path. A read replica in ap-southeast-1 does nothing for the write path.

The only architectural change that helps the write-before-respond pattern from Singapore is moving the write path closer to Singapore. That is not a replica. It is a primary, or a regional architecture with two primaries and a replication strategy that keeps them coherent, or a re-architecture of the agent so the response does not depend on a synchronous write to the far-away primary.

The shape of the work is not "add a read replica." The shape of the work is "decide what the global topology is going to be for the next two years, write the migration plan, and accept that this is a quarter of work rather than a sprint." That is the realization at the desk at 3:47am.

Read replicas, only when the meter calls for them

I want to start with the simplest of the Stage 5 architectures, the same-region read replica, because the failure mode is the most legible and the success criterion is the most measurable. A read replica is a secondary database instance that subscribes to the primary's write-ahead log and serves read queries against the replicated data. Read traffic the application routes to the replica frees the primary's CPU and I/O for writes and for the transactional work that has to stay on the primary.

The meter that earns the read replica is pg_stat_statements showing read queries dominating total time, consistently, for at least 30 days. Or a CloudWatch metric showing the primary's CPU pegged above 80% during the agent's read-heavy windows. Or a latency histogram showing read query time growing relative to write query time. The 30 day criterion applies regardless.

The honest cost has two parts. The infrastructure cost is roughly that of a smaller database instance, which for a mid-tier Postgres on managed infrastructure lands somewhere in the $200 to $800 per month range. The number is illustrative; the actual bill depends on instance class, region, storage, and the read traffic the replica is asked to serve. The operational cost is the replica lag monitoring work. Someone has to watch the replication lag, because an unmonitored replica drifts, and the first time a stale read causes a customer-visible issue is also the first time the team discovers the replica was behind.

The application has to be instrumented to route reads to the replica intelligently, and this is the part teams underestimate. A read replica with no read routing is a database instance the application never queries. Adding the routing is application work: a wrapper around the data access layer that decides, per query, whether the query is safe to serve from the replica (no read-after-write dependency, no consistency requirement that would be violated by replica lag) or has to go to the primary. Done well, the routing turns the replica from a recurring line item into a load-shedding asset. Done poorly, it serves stale reads to customers and produces bugs that depend on the exact timing of replica lag.

Sibling scene one: the vanity replica

Picture a different team than the founder at the desk in the apartment. The founder of this other team went to a conference six weeks ago. Someone on a panel talked about their multi-region setup with what came across as honest pride. The founder came home with the conference energy and added a read replica in ap-southeast-1 on a Tuesday. The team has no customers in Asia Pacific. Their largest customer is in Amsterdam. Their p99 latency for every customer is 95ms, well inside the SLA they wrote into the contract last year. There is no pg_stat_statements evidence of read pressure on the primary. There is no CloudWatch evidence of CPU contention. There is, however, a slide deck the founder took notes on, which is what the replica is anchored to.

The replica costs roughly $800 per month. The application code has no smart read routing; the engineering team meant to add it but the next sprint filled up with feature work, and the primary still handles all queries. The replica sits at 4% utilization six months later. The 4% is mostly replication overhead and health checks.

A board observer asks about the infrastructure line item at the next quarterly review. The founder explains the replica. The board observer asks what the read pressure problem was that motivated it. The founder says they were being proactive. The board observer nods in the way that means something, and the next funding conversation includes a question about "infrastructure overhead" that the founder did not have a good answer to in the moment.

The lesson I would name explicitly: Stage 5 decisions are reversible only at the cost of someone's confidence in the team's discipline. The replica can be turned off. The conversation is harder to reverse.

Multi region replication, when the meter or the contract calls for it

The conversation in the apartment at 3am is not the read replica conversation. It is the multi region conversation, and the work is heavier.

The meter is a CloudWatch latency histogram showing consistent SLA breach for a specific customer or region for 30 days. Or a customer contract with geographic data residency terms (the customer's data must be stored in a specific jurisdiction, typically the EU or a country with data sovereignty requirements). Or a region failure that took the agent offline for a measurable window and a customer is now asking for the failover plan.

The Singapore scenario is the first kind. The contract names the SLA. The histogram names the breach. The customer's morning has been the team's afternoon for five weeks. The work is to design and ship a regional architecture where the Singapore traffic, including the SLA-binding writes, lands in a primary close enough to Singapore that the round trip fits inside the latency budget.

What that architecture looks like depends on the database. The cleanest shape is a primary in ap-southeast-1 for the Singapore customer's writes, with replication to us-east-1, and a routing layer that directs each customer's traffic to their home primary. Reads for Singapore land locally. Reporting and cross customer queries run against a designated region's primary or a global read view that reconciles the two.

A managed multi-region Postgres product (Aurora Global Database, Cloud Spanner, CockroachDB) presents a different shape, where the database itself handles the cross-region replication and the application code routes traffic to the nearest region. These three products have different replication models. Aurora Global Database keeps writes at a single primary region with cross-region read replicas. Spanner and CockroachDB offer distributed writes with different consistency and cost tradeoffs. The evaluation is worth doing, and the right choice depends on the replication model, the cost, and the lock-in tolerance. My read is that for a team crossing into Stage 5 because of a single customer contract, the managed product is worth seriously evaluating, because the operational load of running your own cross-region replication is substantial. The tradeoff is the lock-in cost and the cost of the managed product itself, which can be material at the data volumes a multi-region Postgres setup introduces. I want to be honest that I have not run a multi-region migration on a live customer-facing system and that the managed product evaluation is a question I would put on the table early.

The honest cost of multi region replication, whether the team builds it or buys it, is a quarter of focused engineering work. Not a sprint. Not a week. A quarter. The data model has to be audited for global consistency requirements: the row level security policies from Stage 2 now apply across regions, which means the policy expressions and the test coverage have to be validated in the multi region context. The backup strategy has to be extended to cover both regions. The monitoring has to cover both regions. The failover runbook has to be written and tested and drilled. Someone has to be on call for the replica region. If "the founder" is the answer to who is on call for ap-southeast-1, the answer determines what the team looks like for the next year.

The Singapore scenario in the apartment ends with a notebook and a quarter long plan. The plan starts with the meter, names the topology, names the audit work, and names the owner. The owner question is the one the founder spends the most time on, because the honest answer is that the founder cannot be the owner of a 24 hour on call rotation. The owner question is what the next hire conversation is going to be about.

Sharding, only against the actual hotspot

Sharding is the Stage 5 decision with the longest downstream tail, and I want to be honest about that before describing the meter that earns it.

The shard key is one of the few architecture decisions that calcifies. After sharding, changing the shard key requires a full data migration: every row moves to a new partition under a new key, foreign key references update, application code updates, and the system runs in a degraded or paused state for the duration. At Stage 5 data volumes, a shard key migration is weeks of engineering with a freeze window, or months with a live migration toolchain. Get the shard key wrong and you pay for it until you rebuild. The book that treats this tradeoff cleanly is Martin Kleppmann's Designing Data-Intensive Applications, which I would put on the desk before any sharding conversation starts.

The meter is the database's own write throughput, segmented by customer and by table. The number that matters is what fraction of total writes is attributable to the top one or two customers, or to a single table dominating writes across all customers. When a single customer crosses 40% of total writes consistently, or a single table dominates write throughput in a way vertical scaling has stopped keeping up with, the sharding conversation has earned its place.

The application question that has to be answered before the shard key is committed is which hotspot the metrics actually show. The intuitive answer is customer_id. The intuitive answer is usually wrong. The real hotspot is usually a cross-customer table where the write pressure comes from aggregate activity rather than from one dominant customer. Events tables, audit logs, message queues, notification tables: these grow fastest because every customer writes to them on every interaction. The customer_id shard key does nothing for a table whose writes are evenly distributed across customers.

Sibling scene two: the premature shard

Picture another team. The engineering team is reading a Stripe blog post about sharding, which is a good blog post about a different workload than theirs. The team's largest customer is 8% of total writes. None of the metrics are showing write pressure that vertical scaling cannot handle. The decision to shard is taken anyway, because the engineering lead has concluded that any product with their customer profile should shard by customer.

The shard key ships in a quarter end sprint. It is a clean migration. The tests pass.

Two years later, the actual write hotspot is not the customer dimension. It is the events table from the Stage 4 audit hardening, which is receiving writes from every tenant's tool calls and is now the largest table in the database by write volume. The events table receives roughly 3,000 writes per second during peak windows, and the writes are not dominated by any single customer; they are the aggregate of all customers' agent activity. The customer_id shard key does nothing for the events table hotspot. The team needs to shard the events table separately, against a different key (likely time-based or actor-based), and the schema is already committed to a customer_id sharding topology that conflicts with the events table's natural partitioning shape.

A full data migration is scheduled. It takes nine days. The team writes a postmortem. The honest sentence in the postmortem is that the shard key was committed against an intuition rather than against the metric.

If you must shard, shard against the hotspot the write throughput metrics actually show. The customer-ID intuition is usually wrong because the real hotspot is usually a cross-customer table where the write pressure is the aggregate behavior of all customers, not the dominance of one. The events table is the most common offender at Stage 5, because the Stage 1 decision to write every agent action to a single events table was the right decision for Stages 1 through 4, and the table grew during all of them.

Semantic caching, only when the bill says so

Semantic caching is the Stage 5 decision most likely to be added because someone read a blog post, and I want to be candid about that.

The pattern is straightforward. A semantic cache intercepts vector retrieval calls, computes a similarity check against a cache of recent retrievals, and returns the cached result if the similarity is above a configured threshold. The lookup avoids re-embedding and re-querying the vector store for queries that are semantically near-identical to a recent one.

The meter is the embedding cost line item in the monthly infrastructure bill, joined to a cache hit rate analysis against the agent's actual query distribution. The bill is the easy part: if the embedding cost is visible enough to be named in the monthly infrastructure review, the cost is material. The hit rate analysis is the harder part, and I would do it before building the cache, not after.

The way to measure hit rate without building the cache is to log a sample of queries over a representative window, compute pairwise semantic similarity against a rolling window using the same embedding model the cache would use, and count what fraction of queries fall above the similarity threshold you would set. My read is that 30% is a reasonable illustrative floor, but the right number is the number your workload produces. A workload of repetitive exploratory queries has a different hit rate than a workload of highly varied analytical questions.

The honest cost of a semantic cache is more than the infrastructure. The cache requires an embedding-based similarity check on every incoming query. The cache invalidation problem is different from a standard key-value cache: a cache entry is keyed on a semantic region of the embedding space, so deciding when a cached result is stale requires a coherence policy that handles the case where the underlying documents have been updated since the entry was written. If the document set changes and the cache layer does not know, the cache serves stale retrievals without the query ever reaching the vector store to fail loudly. The failure mode is silent, which is the harder failure mode to debug.

The theater version is the team that adds the cache at Stage 3 because they read a blog post claiming it would "save tokens." The hit rate against their actual workload turns out to be 4%, because their users ask highly varied questions rather than the repetitive exploratory queries the cache pattern is calibrated for. The cache infrastructure costs more per month than the token savings it generates. The cache is turned off six months later, after a silent retrieval bug serves a stale answer to a customer.

Measure the hit rate before building the cache. The measurement is logged queries and pairwise similarity in a notebook, not infrastructure. If the measurement does not justify the cache, the cache does not earn its place yet.

Failover topology, only when the contract or the failure calls for it

Failover topology is the Stage 5 decision where the work is mostly operational rather than architectural, and the gap between "we have a replica" and "we can fail over" is the part teams underestimate.

The meter is one of two things. Either a region failure has already happened and the customer is asking for the failover plan, or a customer contract names a recovery point objective and a recovery time objective that cannot be met by the current single-region architecture. The first is a failure that earned the conversation. The second is a contract that scheduled it. Neither is "we should be ready in case."

The architecture work is the multi region replication from the section above, plus a designated failover region with the replicated data, plus a DNS or traffic management layer that can cut over from primary to failover on demand. The architecture is the easy part. The operational work is the part that turns the architecture into a failover topology rather than a document.

A failover topology that has never been tested is not a failover topology. It is a runbook in a Google Doc that no one has read in eight months. The test is the cost: a planned failover drill in a maintenance window, where the team actually promotes the replica to primary, confirms that the application reconnects correctly, confirms that the DNS cutover propagates within the documented RTO, and confirms that the last replication sync captured the data within the documented RPO. The drill reveals the gaps.

I would budget six months between "we have a replica" and "we can fail over confidently." The six months is what it takes to write the runbook, drill it, find the first round of gaps, close them, drill again, find the second round, close them, and arrive at a state where the team trusts the runbook enough to invoke it under pressure. The team that has not drilled is not a team that can fail over. They are a team that has a replica and a hope.

The theater version is the team that writes the runbook and never drills it. A region failure occurs on a Friday afternoon. The team discovers at 4pm Eastern that the runbook describes a manual DNS cutover procedure that depends on a Route 53 configuration changed four months ago. The customer in the affected region is down for six hours while the team figures out the actual current procedure. The postmortem is honest. The honest sentence is that the architecture was in place and the runbook was not exercised.

The audit-specific read replica, the Stage 4 carryover that finally lights up

There is one read replica conversation that is narrower and more defensible than the rest, and it is the Stage 4 carryover that pressures into Stage 5.

The events table from Stage 4 is now receiving compliance reads. The auditor's annual review queries the table. The customer's data export requests scan it. The regulatory reporting runs walk large portions of it. The compliance reads are long and scan-heavy. The production writes the agent generates are short and frequent. The two workloads are competing for the same I/O on the primary, and the write latency spikes during the compliance read windows in a way the production traffic cannot afford.

The meter is Postgres wait event data showing the events table's read scans blocking or delaying write commits, or CloudWatch showing write latency on the events table spiking during compliance read windows. The Stage 5 resolution is an audit-specific read replica that serves only the compliance reads. The scan-heavy queries are routed to the replica. The production primary's write capacity is reserved for the agent's writes. The two workloads stop fighting.

This is the narrowest and most defensible Stage 5 addition. The replica has a named purpose (compliance reads), a named pressure (write blocking observed in wait events), a named meter (the wait event data), and a named customer (the auditor or the compliance function). If the team is already running a read replica from the earlier multi region work, the audit-specific replica may be the same instance configured with a routing rule rather than a third instance.

I would name this as the easiest Stage 5 decision to defend in a budget conversation, because every part of the justification can be pointed at on a dashboard. The cost is low, the operational tax is small, the reversal plan is trivial (turn off the routing rule, the replica returns to general read traffic), and the meter is unambiguous.

What does not light up at Stage 5

I want to spend the same surface area on the deferrals as I spent on the additions, because the honesty thesis of this post depends on it. The "almost never" column from the Stage 1 deferral table has been waiting for Stage 5 to either light up its cells or confirm that they stay where they were. Most of the cells stay where they were. A few have triggers that can be named explicitly. None of them light up by default, and the temptation to light them up because the team is now "at scale" is exactly the temptation this post exists to argue against.

Custom database forks. Trigger: never. The teams that maintain custom storage engines are infrastructure companies, not agent products. The trigger would be a workload impossible to express efficiently in any available database's query model, at a volume where the inefficiency is worth the engineering cost of maintaining the fork forever. That workload does not exist for an agent product. If it appears to exist, the first question is whether the workload has been expressed correctly in the existing database, not whether the database should be replaced.

Multi cloud. Trigger: almost never, and usually a contractual ask rather than an engineering decision. Multi cloud is a vendor lock-in hedge presented as an architecture decision. The operational tax (two monitoring stacks, two security models, two network topologies, two engineering knowledge bases) is real and permanent. The benefit is mostly theoretical. The legitimate trigger I can imagine, and have not personally encountered, is a customer contract that explicitly requires workload redundancy across providers.

Service mesh. Trigger: almost never. A service mesh (Istio, Linkerd, Consul Connect) adds mTLS, traffic management, and observability between microservices. The trigger is a microservices architecture at a complexity where service to service traffic is too varied to manage at the application layer. An agent product running a productive monolith should not add a service mesh. A service mesh without the microservices is infrastructure for a topology that does not exist.

Microservices decomposition of a working monolith. Trigger: almost never, and most rewrites at this stage are failure modes dressed up as progress. The legitimate trigger is a deployment bottleneck, a scaling bottleneck on one component that cannot be extracted without decomposition, or a deliberate team organization decision (Conway's Law applied with intent). Not "we read that microservices are the right architecture." A working monolith at Stage 5 is an asset.

Re-architecture of the LLM pipeline. Trigger: almost never. The LLM provider's API is the architecture for most agent products this series describes. A team that builds a custom orchestration layer above the provider's API has either a very specific multi-model routing requirement the available frameworks cannot serve, or a preference for building infrastructure the product does not require. The former is a legitimate trigger. The latter is the failure mode I want to name explicitly.

Custom storage engine for vectors. Trigger: almost never. A dedicated vector store earns its keep when the existing pgvector setup is the named bottleneck on a measured retrieval flow. A custom storage engine built on top of a dedicated store is in the same category as a custom database fork. The trigger is workload impossibility, not inconvenience.

"We may need this in three years" architecture. Trigger: explicitly never. Architecture built against a hypothetical future workload that has not been measured is architecture built against fiction. Stage 5 is the chapter where "we may need this in three years" is replaced by "the meter says we need this now," and the replacement is the whole discipline.

The "if ever" column populated, one row at a time A two-column table where the left column carries the category name and the right column carries its disposition. Six rows at the top are marked NEVER or NEVER (ALMOST) in muted grey-red treatment; six rows beneath are marked TRIGGER NAMED in purple, each with the specific meter or contract that earns the architecture spelled out. A dividing band separates the two halves with the count of each disposition. The "if ever" column, populated at the end of the series Twelve deferrals from Stages 1 through 4. Six stay never. Six have triggers that can be named. CATEGORY DISPOSITION Custom database forks infrastructure-company work, not agent-product work NEVER no plausible trigger Custom storage engines same category as forks; workload impossibility, not inconvenience NEVER (ALMOST) workload impossibility only Multi-cloud vendor lock-in hedge presented as architecture; operational tax is permanent NEVER (ALMOST) contract-driven if at all Service mesh infrastructure for a topology that does not exist on a working monolith NEVER (ALMOST) microservices first, mesh second Microservices decomposition of a working monolith most rewrites at this stage are failure modes dressed up as progress NEVER (ALMOST) deployment or scaling bottleneck Re-architecture of LLM pipeline provider's API is the architecture for most agent products this series describes NEVER (ALMOST) multi-model routing requirement THE METER EARNS THESE Vector store migration stage 3+, when pgvector is the named bottleneck on a measured retrieval flow TRIGGER NAMED measured retrieval bottleneck Read replicas stage 5, pg_stat_statements showing read dominance for 30+ days TRIGGER NAMED pg_stat_statements top-by-time Multi-region stage 5, contracted SLA or 30 days of histogram breach TRIGGER NAMED contract or histogram Sharding stage 5, single customer above 40% of writes or one table dominating throughput TRIGGER NAMED 40% threshold on the meter Semantic cache stage 5, embedding bill named in monthly review and measured hit rate above threshold TRIGGER NAMED measured hit rate, not blog post Failover topology stage 5, RPO/RTO in the contract or a region failure that already happened TRIGGER NAMED contract or incident

The work, in honest scale

Stage 1 was a Friday night. Six decisions, mostly additive, a few hours of typing and a screenshot to a cofounder.

Stage 2 was a Thursday. One column add, one policy turn-on, three test runs. Half a day if Stage 1 was done well.

Stage 3 was a week. Idempotency enforcement, outbox pattern, RBAC turn-on, planned maintenance window.

Stage 4 was a month. Hash chain migration, append only constraint, actor_kind enum tightening, secret dereference layer, runbook, drill, security review.

Stage 5 is a quarter. The work is heavier than Stages 2 through 4 combined. Multi region migration alone is a quarter, with a freeze window or a live migration toolchain. Sharding is a quarter, with a postmortem at the end if the shard key was committed against intuition. Failover topology is six months from architecture to drilled, because the architecture is not the topology until the drill has happened.

Stage 5 is also the only chapter where I would seriously consider hiring a contractor specifically for the migration, and I want to name that as my read rather than a rule. The reasoning is that Stage 5 work is specialized (cross region replication is a domain expertise), the failure modes are severe (a misconfigured replication setup can cause data loss or silent stale reads), and the operational load is permanent. The founding team's opportunity cost of spending a quarter on Stage 5 infrastructure is a quarter not spent on the product. If the trigger is a single customer's contract terms, a contractor who has run this migration before may be the right trade. The contractor is the migration. The team is the steady state.

This is downstream of a sales conversation, not an engineering conversation. The teams that legitimately cross into Stage 5 almost always do so because a customer signed a contract with terms the current architecture cannot meet. A latency SLA in a region the agent cannot serve from a single primary. A geographic data residency requirement that prohibits storing the customer's data in us-east-1. A write volume from a single customer that the database cannot accommodate without sharding. The contract is the trigger, not the roadmap. Stage 5 is the chapter where the founder stops being the protagonist of the system's growth and starts being the steward of someone else's contract.

The full ladder, with Stage 5 lit on measured pressure The same heatmap grid that ran through Stages 2, 3, and 4. The Stage 5 column header is now in primary text, and two of its rows turn solid required: materialized views/replicas/cache, and multi region/failover/sharding. The Stage 2, 3, and 4 columns and their lit cells carry over. The "if ever" column is finally populated: most cells stay muted as never, and the event sourcing row holds an amber "trigger named" treatment because no contract for it has appeared. Two earlier deferrals (audit replica and one Stage 5 carryover) sit half-lit, naming the meter that opens them. Two more cells lit at Stage 5; eight earlier cells carry over The architecture extensions earn their place. The "if ever" column finally carries answers, mostly no. dormant lights up required never trigger 02 Co-builders first NDA'd operator 03 Paying first contract signed 04 Regulated audit + secrets live 05 Measured pressure meter or contract earned the work 05+ If ever documented trigger Row level security policies required required Roles table, RBAC required Idempotency key enforcement required Outbox pattern required actor_kind discipline, real values required required Agent secret separation required Hash chained audit, append only required GDPR tombstones required Materialized views, replicas, cache required Multi region, failover, sharding required Event sourcing never Custom forks, multi-cloud, mesh never Stage 5 is downstream of a sales conversation, not an engineering conversation. Two cells lit on measured pressure. The rest stay deferred. The "if ever" column carries answers, mostly no.

The pre flight, smaller and weirder than before

Every previous post in this series ended with a pre flight checklist. The Stage 5 pre flight is smaller and stranger, because it is mostly about meters rather than migrations. Six items. Each one is a question that has a binary answer.

1. The meter exists. Is there a dashboard, in a specific tool, that shows the pressure you are about to architect against? The specific metric, in the specific tool, visible to the specific engineer who will own the architecture decision. If the answer is "we know we'll need it eventually" rather than "here is the graph and here is the timestamp on the last reading," stop. The architecture decision is not ready. Build the meter first. Watch the meter. Come back when the meter has data.

2. The meter has been showing the pressure for at least 30 days. One bad week is a debugging problem, not an architectural trigger. The 30 day floor is my read, and I would defend it as the right discipline. If the dashboard was first checked this week, come back in 30 days. If the pressure has been consistent for 30 days, the conversation about the architecture can start.

3. The contract terms reference the SLA, residency, or commitment you cannot meet. If yes, the architecture decision has a deadline and a contractual consequence for missing it. If no, the meter is showing headroom: the system is under pressure but the customer has not named the pressure as a contractual obligation. The work is still worth doing. The urgency is different. A 380ms p99 latency is a problem worth addressing; the question is whether it is a this-quarter problem or a next-quarter problem, and the contract is what determines which.

4. A reversal plan exists. If the read replica is wrong, can you turn it off in a week? If the answer is yes, the architecture is reversible and the cost of being wrong is bounded. If the answer is no, the decision is permanent and the cost of being wrong is a data migration. Sharding cannot meet this criterion. Multi region replication can be reversed but the cost is high. A same region read replica can be turned off in a day. Semantic caching can be turned off in a deployment. Naming the reversal plan before committing is the difference between a staged experiment and a permanent architectural commitment.

5. A cost ceiling exists. Each Stage 5 decision has a recurring cost. The replica is $200 to $800 per month. The multi region setup is several thousand per month in additional infrastructure. The semantic cache layer is the embedding cost plus the compute for the similarity check. Each cost is bounded if you bound it before committing. If the architecture commits the team to an open-ended recurring bill, the architecture is not ready to commit to. The cost ceiling does not have to be tight. It has to exist.

6. An ownership plan exists. Each Stage 5 decision has a permanent operational cost in addition to the infrastructure cost. Who is on call for the replica? Who reviews the shard rebalancing metrics? Who runs the annual failover drill? Who monitors the semantic cache hit rate after deployment? If the answer to any of these is "the founder," the answer determines what the team looks like for the next two years. Stage 5 is the first chapter where the operational load of the architecture is a meaningful fraction of the total engineering load. Name the owner before committing to the architecture. If the owner does not yet exist on the team, the architecture decision is also a hiring decision, and the hiring decision is what the architecture is actually waiting on.

The six questions are the discipline. The discipline is what separates Stage 5 done because the meter called for it from Stage 5 done because the conference said it would. Most teams reading this post should fail the pre flight, this quarter, on at least two of the six items. The right answer for most teams is to wait, build the meter, and come back when the pressure has earned the architecture.

The end of the small set

If you read all five posts and built the small set well, you bought yourself the ability to make Stage 5 decisions when the meter calls for them, not when ambition does. The bill is real either way. The only thing that changes is whether you paid for the architecture you needed or the architecture you wanted.

The series started on a Friday night in late spring, in an apartment that was small enough to hear the laptop. Six decisions, written into a migration file, with a lamp on and a cold coffee. The decisions were chosen to make every subsequent stage additive rather than reconstructive.

Stage 2 was the conversation with the second design partner, when the tenant_id column earned its keep. Stage 3 was the first paying customer, when the request_id field on every tool call became the enforcement layer that prevented a duplicate refund. Stage 4 was the regulated customer, when the events table carried four million rows of organic write history that the hash chain could walk. Each stage was additive. The foundation held.

Stage 5 is where the wisdom of the small set becomes observable in retrospect. If Stage 1 was done well, Stage 5 is a conversation about which specific architectural extension the meter calls for. If Stage 1 was not done well, Stage 5 is a reconstruction project wearing an architecture project's clothes. The Friday night in May determined which one.

If you are weighing a Stage 5 decision and the meter is unclear, I would welcome a conversation. The contact form at the top of the site goes to my inbox.

The founder at the desk closes the pager alert. The Singapore customer's morning is still four hours away. The founder makes a fifth cup of coffee from the new bag, which is a different bean than the one that started the year. The kitchen window faces east. The light is starting to come up around 5:30am, the early-spring light, grey for a long time before it commits. The agent is running. The pager is silent. The notebook on the desk has the start of a quarter long plan written in pencil. The plan begins with the meter and ends with the owner. The founder will read it again tomorrow morning, when the apartment is no longer the kind of quiet that 3am apartments are quiet.

Frequently Asked Questions

How do I know if my meter is real?

A real meter has three properties. First, it is a specific named metric in a specific named tool, not a felt sense that something is pressured. pg_stat_statements showing read queries dominating total time. CloudWatch ReadLatency segmented by customer. Write throughput by table over a defined window. If you cannot point at the dashboard and read the number out loud, the meter is not real yet. Second, the metric has been showing the pressure for at least 30 days. A bad weekend is not a meter; a month of consistent readings is. Third, the metric correlates with a customer-visible or contractually relevant outcome. A primary CPU at 70% is not a problem in itself. A primary CPU at 70% during the windows when a customer's SLA is failing is a problem.

Should I move to a managed multi-region Postgres provider (Aurora Global Database, Cloud Spanner, CockroachDB) instead of building my own?

My read is that for a team crossing into Stage 5 because of a single customer contract, a managed multi-region product is worth seriously evaluating before committing to a self-built setup. The operational load of running cross-region replication yourself is substantial: monitoring lag, handling failover, managing the routing layer, drilling the runbook. The managed product takes some of that load off the team in exchange for a higher recurring bill and a higher lock-in cost. The evaluation question I would put on the table is whether the team's bottleneck is engineering capacity or infrastructure cost. I have not personally run this evaluation for a live customer-facing system, and the right answer depends on a budget and a team profile this post cannot know.

Is semantic caching ever the right starting point at Stage 5?

Almost never, in the sense that you should never add a semantic cache as the first Stage 5 architecture without measuring the hit rate first. The right starting point is the meter, and the meter is two-sided: a material embedding cost on the bill, joined to a measured hit rate against your specific workload. If both sides show pressure, the cache earns its place. If only the bill shows pressure but the hit rate analysis returns 4%, the cache is a different solution to a different problem (likely an embedding model choice, a prompt design issue, or a retrieval strategy revision). My read is that semantic caching is the Stage 5 decision most likely to be added before the measurement has been done, because the promise is intuitive and the "measure first" step is the one that is easiest to defer.

How do I decide if my customer's SLA is a Stage 5 trigger or a renegotiation?

The question I would ask first is whether the contract was signed with full visibility into the architectural implications of the SLA. If both teams treated the SLA clause as boilerplate without modeling whether the current architecture could meet it, the conversation is potentially a renegotiation rather than a Stage 5 trigger. A revised SLA that the architecture can meet from one region is cheaper than a quarter of multi-region work, and the customer may agree to the revision if the architectural cost is named honestly. If the SLA was a hard requirement the customer would not have signed without (typical for enterprise contracts where the latency budget maps to their own downstream SLAs), then the SLA is a Stage 5 trigger. The honest conversation with the customer is the cheapest first step.

What about Aurora, Spanner, CockroachDB? Are they Stage 5 shortcuts?

They are shortcuts in the sense that they shift the operational load of cross-region replication from your team to the provider. They are not shortcuts in the sense that the application work (read routing, write routing, audit work, monitoring, runbook, drill) is still on your team. The three are also not interchangeable. Aurora Global Database keeps writes at a single primary region. Spanner and CockroachDB offer distributed writes with different consistency and cost tradeoffs. The replication model is the first axis of the evaluation. The managed product turns a quarter of work into something between a month and two months, depending on the application's complexity. The cost is the recurring bill and the lock-in, and the bill is not small at the data volumes a multi-region Postgres setup introduces.

Do I really need to hire a contractor for Stage 5?

My read is "seriously consider it" rather than "yes, always." Stage 5 work is specialized, the failure modes are severe, and the founding team's opportunity cost of a quarter on infrastructure is a quarter not spent on the product. A contractor who has run a cross-region migration on a comparable system before brings experience the founding team probably does not have, and the engagement is bounded. The reasons I say "consider it" rather than "yes" are that some founding teams have the depth, the contractor budget is not always available, and the market for this specific work is small enough that finding the right person can itself take a month.

What does "success" look like after Stage 5?

Success looks like a quieter dashboard than the one the founder was reading at 3am. The Singapore customer's p99 is back inside the SLA. The pager has not gone off for the affected region in eight weeks. The audit-specific replica is serving the compliance reads without blocking production writes. The on-call rotation for the replica region is staffed and rested. The runbook has been drilled twice and the second drill went better than the first. The founder is no longer the protagonist of the system's growth. The founder is the steward of a system that is meeting contractual commitments to customers who will never meet the founder, and the system is quiet enough that the founder can spend Tuesday afternoon on a product conversation rather than on a dashboard. That is what success looks like. The dashboard is quiet. The contract is being met.

Code Atelier · NYC

Ready to get agent-ready before your competitors do?

Let's talk