All articles
Agent Engineering··25 min

When consequence becomes real

Stage 3 is the chapter where the first card gets charged. Three turn ons that mean the difference between an agent doing its job and a customer feeling like their money was respected.

When consequence becomes real

It is a Tuesday in late November. The room is the same room and the desk is the same slab of birch, but the light through the window has gone the color of an unwashed nickel and the lamp has been on since two in the afternoon because the cloud cover came in over the weekend and has not lifted. The agent in the other terminal is the same agent that has been running for almost six months now, since the Friday in May when you typed tenant_id on a line by itself and made a future Thursday's life cheap.

On the second monitor, three things are new. The first is a PDF of a real contract, signed last Thursday by the partner whose UUID has been in the top drawer of the desk since early autumn. The contract has a real ACH on it and the first payment cleared this morning at nine forty seven. The second is a Slack message from the partner, sent eleven minutes ago, that reads "Maya is going to log in tonight, can you also flip the switch for Carlos by Wednesday, he is going to need the full thing." The third is the agent's scheduled jobs panel, with a row at the bottom that says generate_invoice scheduled for two a.m. Wednesday, run for the first time tonight against a paying tenant, against a credit card that has a real human's name on it.

The question on the screen this Tuesday afternoon, and the only question that matters between now and Wednesday morning, is small. What runs today and tomorrow so that the first invoice the system has ever issued against a real card is the kind of invoice nobody has to apologize for?

In the first post I named four axes and a five stage ladder. In the Stage 1 post I named six additive decisions and a list of deferrals each waiting for its specific trigger to arrive. In the Stage 2 post I turned on the first of those deferrals: row level security, connection role hygiene, and the actor_kind discipline in the events table. Five deferrals remained at the end of that Thursday. Three of them light up this week. RBAC, idempotency enforcement, and the outbox. The trigger is the same for all three, and it is the contract that cleared this morning at nine forty seven.

This is a piece about that work. One paying customer with a real card, three people inside their account who are about to need different views of the same data, and an agent that is about to do something that costs money. If Stage 2 was done well, this is roughly a week of focused work for a two person team. If Stage 2 was rushed, the bill arrives now, and it arrives at the same time the customer is asking when Carlos is getting his admin role.

The first piece of work, which is a table small enough to read out loud

The cursor is on a new migration file. You have called it 2026/11/stage_3_user_roles.sql, and the name is the entire intent of the file. The table is short. Five columns, three role names, one unique constraint. It looks like this on the screen when you finish writing it.

CREATE TABLE user_roles (
  id uuid primary key default gen_random_uuid_v7(),
  tenant_id uuid not null,
  user_id uuid not null,
  role text not null check (role in ('admin', 'member', 'viewer')),
  created_at timestamptz not null default now(),
  unique (tenant_id, user_id)
);

Six lines. Three roles. A CHECK constraint that names the entire role vocabulary so no application code can write a fourth value without a schema change happening first. A unique constraint on (tenant_id, user_id) because a user has one role inside one tenant, and the schema is the layer that enforces it instead of the application code remembering to. The Stage 1 decision on the primary key is doing its quiet work three lines down, where gen_random_uuid_v7() would have been bigserial if Friday in May had gone differently.

The shape of the table is the shape of the product surface as of this Tuesday afternoon. The founder is admin. Maya the analyst is member. Carlos the ops lead is admin, because the partner's Slack message asked for it explicitly. There is no billing role yet, because the only person inside the tenant who touches the billing UI is the partner herself, who is already admin. There is no auditor role yet, because there is no auditor. There is no owner role yet, because the distinction between admin and owner has not surfaced. Three is what the product surface has earned, and the schema's job is to model the surface that exists rather than the surface that might.

Stage 3 roles, permissions, and the agent's inherited posture A three by six permissions matrix for admin, member, and viewer roles, with the rightmost column visually separated to show that the agent inherits the calling user's permissions rather than acting as a distinct principal. Stage 3, three roles, no billing role, no auditor role The rightmost column is the agent acting on the row's behalf. It carries the same cells as the human's row. VIEW DASHBOARD EDIT OWN DATA RUN AGENT ACTION INVITE USER SEE BILLING AGENT ACTING ON MY BEHALF admin founder, Carlos member Maya, analyst viewer read only seat Roles emerge from product surface contact, not PRD speculation. Three is the minimum that earns its keep.

The application change behind the migration is small and lives in two places. The first is the data access function q from Stage 1, which has been threading tenant_id through every read and write since May. The signature grows by one parameter. Where q(tenant_id, query, params) was the entire contract, the new contract is q(tenant_id, user_id, query, params), and the function looks up the user's role from user_roles at the start of every request and stores it in a session GUC called app.current_user_role. The role is then available to the application code in any check that needs it. The query patterns themselves do not change. The threading of context grows by one field, the same shape the Stage 2 work had when it dropped the sentinel and started carrying real tenant identifiers.

The second place the change lives is the agent's own calling context, and this is the judgment call I would name first to a peer founder asking what to look for in their Stage 3 review. The agent is a caller. The tool calls it makes land side effects. The question of what role the agent carries when it makes those calls is the question that decides whether RBAC is doing real work or theatrical work. My read is that the agent should carry the role of the user on whose behalf it is acting, not a tenant wide service account that bypasses the check. If Maya asks the agent to summarize the dashboard, the agent's calling context carries member, and the read is scoped to what member is allowed to see. If Carlos asks the agent to invite a new user, the agent carries admin, and the call is allowed.

The alternative posture, which is the one I would check for first in any agent codebase, is the ambient tenant service account. The agent's session carries a synthetic principal with full access inside the tenant, and the RBAC check passes by default because the synthetic principal is admin on everything. Maya's read flows through the agent and returns rows that Maya herself could not have read directly. The RBAC layer is present in the codebase. The roles table exists. None of it does any work when the agent is involved, because the agent's calling context bypasses the check. This is the failure mode I would write on the wall in the apartment if I were the founder this week.

A short detour into a sibling scene, which is the failure mode of building too soon

Before the next piece of work lands, I want to put a sibling scene on the table the way the previous posts did. The question of whether the roles table you wrote a moment ago is the right roles table is the most consequential one in this section, and the wrong answer to it is a failure mode I want you to be inoculated against the same way the Brooklyn team in the Stage 1 post was the cautionary scene for the over builder.

Picture a different Tuesday afternoon, a year earlier, in a different apartment. A two person team has read a SaaS playbook the week they shipped their Stage 1 schema. The team built six roles on the first weekend. owner, admin, billing, manager, member, viewer. They wired them into a permissions table that joined users to tenants to roles to capabilities, with twenty four capability flags spread across the surface of a product they had not yet built. They were proud of the table. They took a picture of it on the whiteboard.

Two years pass. The team has shipped. The customers do not use the roles. The customer accounts in production carry users who are all owner or all admin, because nobody at the customer's side ever sat down to figure out which of the six applied to whom, and the smallest cost is to make everyone a senior role. The capability flags have, in the meantime, been bypassed by application code in fourteen distinct places, each written under deadline pressure by a developer who looked at the capability check, could not figure out which flag controlled the surface, and wrote a tenant scoped check instead. The roles table is still there. The application is doing tenant scoped checks in fourteen places and the roles model has become decorative.

In the third year, a customer asks for a real role distinction. An analyst who should see dashboards and not exports. The dashboards and exports flow through three of the fourteen bypass paths. The role they need is not one of the six. It is a fourth role that nobody anticipated, because the customer's mental model of "analyst" does not map to member or viewer, and the team is reluctant to add a seventh role because seven felt too many when six was the design. The team rebuilds. They cut down to two roles, admin and member. The customer's analyst becomes member with application level overrides on the export surface. The migration is a few weeks of work. The customer is patient. The team is embarrassed.

The lesson is the one I would underline in the playbook if I were editing it. Roles emerge from product surface contact, not from PRD speculation. The right roles model is the smallest one that maps cleanly to the product surface as it exists this week. Three roles at Stage 3 is the answer because three is what the founder, Maya, and Carlos need this week. The fourth role, when it comes, will come from a real customer asking a real question that the three role model cannot answer, and the migration will be a column add and a CHECK constraint update, not a rebuild. The schema's job is to lag the product, not to lead it.

Back to your apartment, where the cursor has moved off the roles migration and is now on a file that does not yet exist.

The second piece of work, which is the column that has been waiting since May

The cursor is on a new migration file. 2026/11/stage_3_idempotency.sql. This is the file that turns on the layer that Decision 6 in the Stage 1 post was a deferral for. The request_id field that has been sitting on every tool call input schema since the first weekend in May, generated by the agent on every call, accepted by every implementation, and ignored by all of them. Tonight is the night the field starts being read.

The table is short.

CREATE TABLE idempotency_keys (
  key uuid primary key,
  tenant_id uuid not null,
  response jsonb not null,
  created_at timestamptz not null default now(),
  expires_at timestamptz not null default (now() + interval '24 hours')
);
CREATE INDEX ON idempotency_keys (expires_at);

The unique constraint comes for free from the primary key. The TTL column lets the cleanup job find expired rows cheaply, and the index on expires_at is what makes the cleanup query a scan of a narrow range. The cleanup itself is a one line DELETE FROM idempotency_keys WHERE expires_at < now() on a schedule.

The shape of the enforcement at the tool layer matters more than the schema. The agent makes a tool call. The tool wrapper looks up the request_id in idempotency_keys. If the row exists and is not expired, the wrapper returns the cached response and does not invoke the underlying tool. If the row does not exist, the wrapper runs the tool, captures the response, and writes both the key and the response into idempotency_keys in the same transaction that any domain table writes the tool performed. The ON CONFLICT DO NOTHING clause on the insert is the concurrency guard. If two retries of the same request_id arrive at the same instant, both will pass the initial "not in the table" check, both will attempt to insert, and exactly one will succeed. The other will fail the unique constraint cleanly, and the wrapper will then read the committed row and return its cached response.

The pattern is documented well in Stripe's writeup on idempotency keys, which has been my reference for this layer. The piece the Stripe post earns its keep on, in my read, is the distinction between idempotency keys as a property of the request and idempotency derived from a hash of the payload. The request bears the key. The key is the truth. If the payload changes between the original call and the retry, the retry is a new request and the key has to be new. The agent's planner owns the request_id lifecycle, which is why the field has been on the input schema since May, and which is why the retry logic has to pass the same request_id it received on the original call rather than generating a new one.

The retry logic is the part that earns the most careful reading this Tuesday. The agent's planner has been generating request_id values on every tool call since May, but the intuitive implementation of the retry path generates a new request_id on every retry. The reasoning is intuitive (a retry is a new attempt) and wrong for Stage 3 (a retry, from the idempotency layer's perspective, is the same logical request as the original call). The change is small in characters. The retry catches the failure of the original call, observes the request_id from the original, and reissues with the same request_id. If the underlying tool already ran (the network blip dropped the response, not the request), the second call returns the cached response. If the tool did not run, the second call runs cleanly.

The window length is a product decision more than a technical one. My read is that 24 hours is a defensible default for most tool calls. The window has to be longer than any realistic retry scenario the agent can produce (for synchronous tool calls, rarely more than minutes) and shorter than the time after which a re arrival of the same request_id would be more likely a coincidence than a retry (for UUID v7 values, close to never). Billing operations may warrant a longer window because a customer support flow that triggers a retry could land days after the original call. The variable to tune against is the realistic retry window for your specific tool call.

One piece of this section I want to flag as Stage 3 boundary work. The connection pooler hygiene from the Stage 2 post, where the GUC bleed test confirmed that app.current_tenant does not survive across connection reuse, is now structural rather than checklist. The idempotency table is keyed on request_id, but every read against it also has to respect the tenant_id filter from the row level policy. If the GUC bleeds across connection reuse, an idempotency check for one tenant could read or write the row of another, and the failure is silent because request_id is globally unique. My read on the implementation choice is the same as the Stage 2 read. PgBouncer in transaction mode is the cleanest path, or explicit SET LOCAL calls at the start of every transaction if the pooler mode is not configurable on your managed Postgres. The Stage 2 GUC bleed test gets promoted from a pre flight check to a permanent CI invariant this week.

The third piece of work, which is the moment two operations become one

The cursor moves to a third migration file. 2026/11/stage_3_outbox.sql. This is the file that turns on the layer that the Stage 2 post named as the third deferral and that has, until tonight, been a debugging story rather than a customer story.

The table looks like this.

CREATE TABLE outbox (
  id uuid primary key default gen_random_uuid_v7(),
  aggregate_id uuid not null,
  event_type text not null,
  payload jsonb not null,
  created_at timestamptz not null default now(),
  processed_at timestamptz
);
CREATE INDEX ON outbox (processed_at) WHERE processed_at IS NULL;

The partial index on unprocessed rows is what the drain worker uses to find work and what keeps the worker's SELECT cheap as the table grows. Processed rows accumulate. They are kept for the same audit reasons the events table keeps its history. The drain worker reads unprocessed rows in batches, dispatches the downstream side effect, and writes processed_at once the side effect confirms.

The shape of the pattern matters more than the schema. The pattern is described well in Chris Richardson's canonical writeup of the transactional outbox, which has been my reference since I first read it. The agent's intent and the downstream side effect are two distinct things. The intent is the commitment the system makes when it processes a request. The side effect is the world changing event the intent represents. The two have to happen atomically from the customer's perspective, but they cannot happen atomically from the system's perspective because the database and the downstream are two different systems with two different reliability surfaces.

The naive flow, the one the pattern naturally defaults to if the outbox is not in scope, is that the agent's response handler writes the domain table row and then calls the downstream side effect. If the network drops between the domain write and the side effect call, the domain write is committed and the side effect never happens. The customer's dashboard shows the action as done. The world does not reflect it. The customer notices days later when the expected email never arrives.

The outbox flow puts both writes inside one Postgres transaction. The response handler writes the domain row and an outbox row in the same transaction. The transaction either commits both or neither. If it commits, the outbox row is the durable commitment that the side effect will run, regardless of what the network does next. A separate worker reads unprocessed outbox rows on a schedule and dispatches the side effect. The dispatch is at least once. If the dispatch fails, the row stays unprocessed and the worker retries on the next cycle. The downstream side effect has to be idempotent, which it is, because the agent carries request_id on every tool call and the downstream tool reads it through the idempotency table from the previous section. The two layers compose. The outbox guarantees the side effect runs at least once. The idempotency layer guarantees it has the effect of running at most once.

The outbox, one Postgres transaction and a worker that drains it An upper lane shows the response handler writing the domain row and outbox row inside a single Postgres transaction. A worker drains unprocessed rows on a schedule and dispatches side effects with at least once semantics. A muted lower lane shows the failure mode where the side effect is skipped. Two writes, one transaction, a worker that drains the rest The atomic boundary lives at the database. The side effect runs at least once and is idempotent at the tool. 01 — ATOMIC WRITE ONE POSTGRES TRANSACTION LLM response handler INSERT INTO domain_row SAME TX outbox table INSERT INTO outbox_row COMMIT or NEITHER 02 — DRAIN WORKER, AT LEAST ONCE pg_cron worker poll every 30s processed_at IS NULL stripe.charges.create() postmark.email.send() webhook.dispatch() on success UPDATE processed_at = now() — OUTBOX SKIPPED, THE CAUTIONARY PATH domain_row commits dashboard shows done network drops 9 seconds, response lost side effect never fires world does not reflect intent and reality drift apart Without the outbox, the agent's intent and the customer's reality drift apart. The grey path is the failure mode.

There is a tempting alternative I want to address. The team is using a managed queue service (SQS, Pub/Sub, Cloud Tasks) and puts the side effect on the queue inside the response handler. The argument is that the queue is durable and dispatches reliably, so the outbox is redundant. The argument is correct if and only if the queue write is inside the same Postgres transaction as the domain write. For most managed queues, it is not. The Postgres transaction commits, and the queue write happens as a separate operation. If the application crashes or the network drops between the commit and the queue write, the domain row is committed and the queue never sees the request. The failure mode is identical to the no outbox case. The outbox lets you defer the message broker decision while still getting transactional guarantees at the database boundary.

The drain worker implementation at Stage 3 is the simplest thing that runs reliably. A pg_cron job that runs every thirty seconds, picks up a batch, dispatches, and marks processed_at is enough for one paying customer's volume. A small background process is the slightly more flexible option if your side effects need careful backoff. The wrong choice at Stage 3, in my read, is installing a distributed message queue to drain the outbox. That solves a scale problem you do not have yet and adds operational complexity you have not staffed for. The outbox is designed to defer the message broker decision. Do not undo the deferral by installing the broker the week the table goes live.

A second detour, which is the Tuesday Carlos got billed twice

I owe you the second sibling scene before the thesis lands, because the first sibling scene was about a roles model built too soon, and the symmetric failure is the same set of turn ons shipped without the layers that make them coherent.

Picture a different founder on a different Tuesday in late autumn. Single founder. Shipped Stage 1 well. Shipped Stage 2 well. Has a real paying customer, a contract that cleared this morning, and three people inside the customer's account. Builds the roles table. Wires the RBAC checks. Confirms the agent inherits the caller's role. Watches Maya log in and see what member should see, watches Carlos log in and see what admin should see, sends a screenshot to the partner. The partner is happy.

The founder does not ship idempotency enforcement. The reasoning is reasonable. The request_id is on every tool signature already. The retry logic has been working fine for six months. Nobody has reported a duplicate side effect. The founder is on a tight schedule and the layer can wait a week. The founder also does not ship the outbox. The reasoning is also reasonable. The response handler writes the domain row and then calls the downstream tool. The two operations are close together. The network has been stable. The outbox can wait a week.

Tuesday evening, eleven forty seven p.m. The agent's scheduled generate_invoice job kicks off for Carlos's account. The billing tool calls the payment processor. The processor processes the charge against Carlos's company card and returns a success response. At eleven forty seven and fourteen seconds, the network connection between the agent host and the database drops for nine seconds. The retry logic, written to be robust, catches the failure of the domain write that should have followed the successful charge. The retry does what it has been doing for six months. It retries the tool call. The billing tool runs again. The processor sees the call, sees a newly generated transaction ID, and processes the charge a second time. Stripe's own idempotency catches the second charge if and only if the second call carries the same idempotency key. The billing tool was written without idempotency enforcement on the agent's side, so the second call generated a new key, and the second charge goes through.

Carlos's company card is charged twice. Carlos is asleep. The first charge generates an invoice row in the customer dashboard. The second charge generates a second invoice row, but the second invoice's processing logic marks the first invoice as paid and zeroes the balance, then registers the second charge as a credit, which zeroes the second invoice too. Carlos's dashboard, when he checks at six in the morning, shows two paid invoices and a current balance of zero dollars. The bank account shows two debits for the same amount. The customer's support ticket arrives at seven fifteen. The next four hours are the worst part of the founder's week. The customer's churn email arrives at three in the afternoon.

Idempotency, with and without, drawn across two lanes The upper lane shows the request_id lifecycle through the idempotency_keys table, with cache hit and cache miss branches and a retry loop that returns the cached response. The lower lane shows the same retry without idempotency, producing a duplicate charge. The retry returns the cached response, or it fires a second charge Same network blip, two lanes. The request_id is the truth, and the idempotency table is what reads it. WITH IDEMPOTENCY REQUEST_ID IS THE KEY 01 REQUEST ARRIVES 02 CHECK CACHE idempotency_keys HIT, RETURN CACHED MISS 03 EXECUTE + RECORD ON CONFLICT DO NOTHING RESPONSE RETURNED ONCE NETWORK BLIP RETRY — SECOND CALL HITS CACHE WITHOUT IDEMPOTENCY CARLOS, 11:47 PM REQUEST ARRIVES EXECUTE charge $1 NETWORK BLIP RETRY new key EXECUTE charge $1 again CHARGE TWICE The retry that reuses the original request_id is the discipline that turns a double charge into a cached response.

The lesson is the one I would write on the wall if I were the founder, in the same handwriting as the previous detour's, slightly larger. The idempotency layer and the outbox layer are not two separate concerns. They are the same concern at two different layers, and skipping either one leaves a hole the other cannot close. The idempotency layer catches the retry at the request boundary. The outbox catches the atomicity gap between the domain write and the side effect. Together they make the customer's experience of the agent's action match what the customer thinks happened. Apart, the customer's experience drifts from the agent's intent in ways that do not have happy endings.

Back to your apartment, where the laptop has been quiet for the last five minutes because you have been thinking about the schedule for the rest of this week.

The thesis, which lands in one sentence and earns the seven that follow

You step back from the migration files. There are three of them now, sitting in 2026/11/ next to the Stage 1 files from May and the Stage 2 files from September. You write down on the back of an envelope the shape of the work that has landed.

One roles table with three values and a check constraint that names the entire vocabulary. One idempotency table with a primary key on the request_id that has been on every tool call since May. One outbox table with a partial index on unprocessed rows, drained by a small worker that dispatches side effects with at least once semantics on top of the idempotency layer's at most once effect. Three turn ons.

That is the entire surface area of Stage 3 for a team that shipped Stage 2 well. The schema grows by three small tables. The data access function q grows by one parameter. The agent's tool wrapper grows by an idempotency check at the top and a transactional outbox write at the bottom. The retry logic grows by the discipline of reusing the original request_id instead of generating a new one. The whole thing is a week of focused work for a two person team.

The thesis is what I have been building toward since the first paragraph. Stage 3 is the chapter where consequence becomes real. Stage 1 you wrote code. Stage 2 a friend logged in. Stage 3 someone's card got charged. The three turn ons are not features. They are the minimum amount of structure that exists between an agent doing its job and a customer feeling like their money was respected. The roles table is the answer to who sees what when there are three people inside one account. The idempotency layer is the answer to what happens when a request runs twice. The outbox is the answer to what happens when the intent and the side effect cannot be atomic on their own. All three are dormant until consequence arrives. All three are required the day it does.

Stage 3 is three turn ons, and they are the structure between an agent doing its job and a customer feeling like their money was respected.

If Stage 2 was rushed, the bill arrives now. The roles model has to be designed against a product surface that already has three users in it, in days not weeks. The idempotency layer cannot be added cleanly without the request_id field already on every tool signature, and adding it during a paying customer's onboarding produces a release freeze on a Wednesday afternoon. The outbox cannot be added without restructuring the response handler to write both rows in one transaction, and that work, mid onboarding, produces an apologetic email to the partner. The transition is always more expensive than the debt was.

What does not light up at Stage 3, and why each absence is a decision

There is a list, and it is roughly half the point of this post, because the discipline of refusing to do things that are not on the small set is half of what makes Stage 3 finishable in a week.

Full audit hardening stays off. The events table is still a regular Postgres table with the actor_kind discipline writing real values. No hash chain yet, no append only constraint, no actor_kind enum, no non repudiable signing. The trigger is the first regulated customer signing and an auditor asking for reconstructability of every action the agent has taken on their behalf. Until then, the events table has been collecting organic history for almost six months and will continue to. The Stage 4 work, when it lands, is additive on top of that history.

Secret separation stays off. The agent still reads its API keys from environment variables. No dereference layer, no separation between the agent's secret surface and the founder's, no scoped credentials per tenant. Same Stage 4 trigger as the audit hardening. The first regulated customer signs and asks how API keys and PII flow through the agent's context. Until then, secrets live where they have always lived and the paying customer does not see them.

Multi region replication, sharding, and read replicas stay off. The Stage 5 triggers are measured pressure on a named bottleneck on a real workload. The drain worker's throughput, the idempotency table's growth rate, the outbox processing latency are all things you start measuring this week, but measuring is not provisioning. The replica earns its keep when the meter says so.

GDPR tombstones stay off. The tombstone pattern depends on the append only constraint, which is Stage 4 work. Semantic caches and token budget optimization stay off. The agent's retrieval cost is what it is, and measuring it honestly takes more than one paying customer. Stage 5 triggers, all of them.

Each absence has a trigger. The trigger is the next customer type, the next consequence shape, the next measured bottleneck. None of the triggers is "the team had bandwidth this sprint." Building any of them earlier is early, not unfinished.

What lights up at Stage 3, and what holds The Stage 2 deferral heatmap with three more cells solidified at Stage 3. Roles, idempotency, and the outbox turn on. The earlier two cells at Stage 2 hold solid as carryover. Everything else still waits for its trigger. Three more cells lit at Stage 3; the earlier two carry over Roles, idempotency, and the outbox turn on. Stage 2's RLS and actor_kind discipline still enforce. The rest hold. dormant lights up required at this stage 02 Co-builders first NDA'd operator 03 Paying RBAC + idempotency + outbox live 04 Regulated SOC2 / HIPAA ask lands 05 Multi-agent concurrent at consequence 05+ Almost never documented requirement Row level security policies required required Roles table, RBAC required Idempotency key enforcement required Outbox pattern required Materialized views, replicas, cache lights up Agent secret separation lights up actor_kind discipline, real values required required lights up GDPR tombstones, append only lights up Multi region, failover, sharding lights up Event sourcing if ever Stage 3 is the first paying customer. Three cells turn solid; Stage 2's RLS and actor_kind discipline still enforce. Five cells lit, five deferred. The trigger for each remaining cell is the next consequence shape, named per row.

Wednesday morning, when the work earns the rest of the week

It is now Wednesday morning. The light is the same November grey it was on Tuesday afternoon, but the lamp is off because the morning grey has at least enough brightness to read by. The three migration files in 2026/11/ were committed last night and merged at eleven thirty. The deploy went out at midnight. The drain worker started its first cycle at twelve oh three. The first outbox row written by a real action landed at twelve fifteen, was picked up by the worker thirty seconds later, dispatched cleanly, and marked processed_at at twelve fifteen and thirty one seconds. You watched the row's lifecycle in the database because you could not sleep until the first outbox row had a processed_at value on it.

What runs this morning is the pre flight. Eight checks, in order. The point of the rhythm is not the checklist as an artifact. It is the half hour between the second cup of coffee and the first meeting.

First, the roles assignment sweep. SELECT u.id FROM users u LEFT JOIN user_roles r ON u.id = r.user_id AND r.tenant_id = u.tenant_id WHERE r.role IS NULL. The expected result is zero rows. Any user without a role assignment falls through to whatever the application code's default is for the missing case. The result is zero rows. Maya has a member row. Carlos has an admin row. The partner has an admin row from the migration's seed insert.

Second, the agent session role check. You run a test session where the agent makes a tool call on behalf of Maya, and you confirm the calling context carries member and not admin. A write attempt that would require admin is rejected, the planner handles the rejection cleanly, and the response to Maya names the role boundary.

Third, the idempotency layer existence check. You confirm the primary key on key, the expires_at default of twenty four hours, and the index. You run a test tool call twice with the same request_id. The second call returns the cached response without invoking the tool. The events table has one new row for the first call and zero for the second.

Fourth, the concurrency guard smoke test. You open two database sessions in parallel and execute the same idempotency insert with the same request_id and ON CONFLICT DO NOTHING. Exactly one insert returns one row affected. The other returns zero. The events table confirms the underlying tool ran exactly once.

Fifth, the outbox drain worker check. You insert a test row into outbox with processed_at IS NULL and wait for the next cycle. The worker picks it up, dispatches the test side effect, and writes processed_at.

Sixth, the outbox transactionality test. I would wire this into CI as a permanent regression, because the failure creeps back through application changes that crossed a transaction boundary without the developer realizing. You write a test that, inside one transaction, writes a domain row and an outbox row, then rolls back. Both rows should be absent after the rollback. If the domain row is present and the outbox row is absent, the writes were not in the same transaction and production will see a committed domain write whose side effect never runs.

Seventh, the GUC bleed test promoted to CI. The test from Stage 2 (open connection, set both GUCs, run query, close connection, open second connection, confirm both GUCs are null) now runs on every deploy. The CI configuration fails the build if either survives. The Stage 2 pre flight check is now an invariant.

Eighth, the events table actor_kind audit. SELECT DISTINCT actor_kind FROM events WHERE tenant_id = '${paying_tenant_uuid}'. The expected result is two values, user and agent, with no system placeholder leaking through from the Stage 1 default. Carlos's account has been live for nineteen hours by the time you run this. The result is two rows. The Stage 2 discipline held under production traffic, which is what this check is really asking. It is also the bridge to Stage 4. If system rows show up in the result, the audit work that lights up at Stage 4 will be forensic instead of additive, and that is a different post for a different week.

The whole rhythm is roughly twenty minutes. The first cup of coffee covered the first two checks. The second cup covered the idempotency and concurrency tests. The outbox tests, the CI confirmation, and the actor_kind sweep ran while a third cup sat untouched. The first meeting on your calendar starts in nine minutes. The pre flight passed. The first invoice ran at two a.m. against Carlos's account. The card was charged exactly once. The outbox processed it. The idempotency layer would have prevented a retry double charge if the network had blipped. The RBAC layer let Maya see exactly what she needed to see when she logged in at nine oh four.

A note for regulated readers

If you are building for regulated customers from day one (HIPAA, PCI, GLBA, SOC 2 audit on the horizon), Stage 3 is still upstream of your real ask, and most of this post sits beneath the floor of what your customer's first contract will require. The RBAC work in this post is the second AC family control to light up in the NIST 800-53 family progression, after tenant isolation, which lit up at Stage 2. The audit hardening, the secret separation, and the full access control surface that the rest of the AC family and the AU family require are Stage 4 work, additive on top of what this Tuesday and Wednesday built. Nothing in this post is legal or compliance advice, and the regulatory reference is for orientation, not interpretation.

Wednesday evening, with the light already gone

It is now Wednesday evening. The November light has already gone. The lamp is on. The apartment is quiet because the partner's check in earlier said that everything was working. The agent has been running with the Stage 3 posture for slightly less than twenty hours. The roles table has three rows. The idempotency table has forty seven rows from the day's tool calls. The outbox table has one hundred and twelve rows, all marked processed_at. The drain worker is on its forty third cycle since midnight.

You open the customer dashboard one more time. Maya's session this morning shows as a member role calling the agent and reading three dashboards. Carlos's session shows as an admin calling the agent and inviting two new users. The first invoice of the system's life shows on the partner's billing page as paid, processed, and reconciled. The Stripe dashboard shows one charge against Carlos's company card, not two. The events table has the corresponding rows, each carrying the correct actor_kind, the correct tenant_id, and the correct foreign key into the outbox row that committed the side effect.

You close the laptop. You walk to the window. The November light has already gone, the way it does in late autumn. The street is dim. The building across the way has lights on in three windows, all of them yellow. You stand at the window for a moment without thinking anything in particular. The agent is running. The first paying customer's first invoice has been paid. The infrastructure that exists between the agent doing its job and the customer feeling like their money was respected is now real, and it is small enough that you can describe the whole of it on the back of an envelope.

The next architect of this system will open the migration files in 2026/11/ and the shape will read as deliberate, because it was. The roles table will read as small. The idempotency table and the outbox will read as the two halves of one concern at two layers. The Stage 4 conversation, when it comes, will land on top of what this Tuesday and Wednesday built.

If you are at the Stage 2 to Stage 3 boundary and any of this resonates, I would welcome a conversation. The contact form at the top of the site goes to my inbox.

Up next: Stage 4, when the first regulated customer signs and the audit conversation changes shape.

Frequently Asked Questions

I shipped Stage 2 but I skipped Decision 6's `request_id`. How big is the Stage 3 migration?

The schema change is not the hard part. Adding <code>request_id</code> to every tool call input schema is one line per tool. The hard part is the coordination across every caller that generates tool calls: the planner, the retry logic, the test harness, the admin panels. My honest read is that this is a few days of careful work, not a nine day migration, but the careful part is the retry path. If your retry path generates a new <code>request_id</code> on retry, the entire idempotency layer is defeated and the failure mode looks indistinguishable from not having the layer at all. The check to run after the migration is to retry a tool call deliberately and confirm the cache hit on the second attempt.

Should the agent have its own role in the RBAC system?

My read is no. The agent is a caller and should carry the role of the user on whose behalf it is acting, not a global service account. The failure mode of an ambient tenant service account is that the agent can read and write things no individual user in the tenant could, and the RBAC layer is decorative from the moment the agent is involved. If the agent's session needs to act across multiple users (a batch job, a scheduled scan), the scope of the cross user action should be explicit in the session context and recorded in the events table with both the agent and the delegating user named as actors.

What is the right outbox worker implementation at Stage 3?

The simplest thing that runs reliably. <code>pg_cron</code> with a short interval is the least infrastructure at Stage 3 scale, because the worker runs inside the same Postgres instance the outbox table lives in. A small background process is the more flexible option if your side effects need careful backoff. The wrong choice, in my read, is a distributed message queue or stream processor. The outbox pattern was designed to let you defer the message broker decision. Do not reverse that by installing the broker the week the table goes live.

Do I need the outbox if I am using a managed queue service already?

If and only if the queue write is inside the same Postgres transaction as the domain write. For most managed queues, it is not. The Postgres transaction commits and then the queue write happens as a separate operation. If the application crashes between the commit and the queue write, the domain row is committed and the queue never sees the request. The failure is identical to the no outbox case. The outbox row in your own database is the layer that makes the two writes atomic at the boundary that matters.

How long should the idempotency window be?

This is a product decision. The window has to be longer than any realistic retry scenario the agent can produce (for synchronous tool calls, rarely more than minutes) and shorter than the time after which a re arrival of the same <code>request_id</code> would be more likely a coincidence than a retry (for UUID v7, close to never). A defensible default for most Stage 3 agent actions is twenty four hours. Billing operations may warrant a longer window because a customer support flow that triggers a retry could land days after the original call. The variable to tune against is the realistic retry window for your specific tool call.

What happens to the events table at Stage 3?

Nothing structurally. The events table from Stage 1 is unchanged. The <code>actor_kind</code> discipline from Stage 2 is still writing real values. The Stage 3 addition is that the outbox worker's dispatched side effects also write events, so the audit record includes both the intent (the outbox row creation) and the execution (the downstream confirmation). The events table and the outbox table serve different purposes. Conflating them is the design mistake the Stage 3 boundary invites. The events table records what happened. The outbox records intent plus a commitment that the side effect will run exactly once.

Do I need RLS changes at Stage 3?

Not to the policies themselves. The Stage 2 RLS policies are keyed on <code>tenant_id</code> and they continue to enforce tenant isolation. RBAC is a separate enforcement layer that lives above RLS, checking <code>user_roles</code> before the query executes. Some teams want to encode RBAC inside RLS by adding a role check to the policy expression. My read is that this is worth doing at Stage 4 and not before. The policy expression becomes complex enough to require careful testing and per role debugging, and at Stage 3 the number of roles and tables is small enough that an application layer check is readable without the abstraction. The <a href="https://www.postgresql.org/docs/current/ddl-rowsecurity.html" target="_blank" rel="noopener noreferrer">Postgres RLS docs</a> are the canonical reading if you want to consider the encoded approach.

Code Atelier · NYC

Ready to get agent-ready before your competitors do?

Let's talk