All articles
Agent Engineering··24 min

When a second operator changes everything

Stage 2 is one migration plus three turn ons. If Stage 1 was done well it is a half day of work, and if it was not, the bill arrives the morning a design partner logs in.

When a second operator changes everything

It is a Thursday in early autumn. The same room, the same slab of birch desk, but the light through the window has gone narrower and the lamp is on at four in the afternoon instead of nine at night. The agent has been live for almost three months. The events table has roughly eighty thousand rows in it, written organically across an early summer of test traffic and the two of you using the product on your own work. The schema on the second monitor is the same one you drew last Tuesday in May, with the six pencil colored additions now living in real Postgres and the migration files committed under 2026/05/ in the repository.

On the second monitor, two things are new. The first is a PDF of a mutual NDA, signed on Tuesday by your design partner and her co founder, countersigned by you on Wednesday morning. The second is a sticky note pressed to the bezel of the monitor, the kind that curls a little at the corner, with a UUID written on it in blue ballpoint. 0193a8c2 7e44 7c40 9f1d 5c8b4e3d2a91. The partner's tenant UUID. You generated it this morning. You wrote it down on paper because you wanted to look at it before you typed it into anything.

The partner logs in tomorrow morning at nine. She has the credentials your co founder issued from a new admin panel that took the two of you a weekend to build. Her tenant identifier is the UUID on the sticky note. Her users will be the second and third humans, after you and your co founder, to write to the system. Her data, when it lands, will arrive in tables that have only ever held your data, and the tables themselves will not know they are about to start carrying two tenants instead of one.

The question on the screen tonight, and the only question that matters between now and nine tomorrow morning, is small. What runs Thursday afternoon so that Friday morning is uneventful?

In the first post in this series I named four axes (tenant isolation, RBAC, agent secret separation, audit) and a five stage operational ladder, and tenant isolation was the first axis to light up. In the previous post I named six Stage 1 decisions you made before the partner existed, and a list of deferrals that each waited for a specific trigger. The first deferral, row level security, was waiting for a second tenant to enforce against. The second tenant is logging in tomorrow morning. The trigger has arrived.

This is a piece about the work that turns that first deferral on. One migration plus three turn ons. If Stage 1 was done well, this is half a day, maybe a full day if you have never written a row level policy before and want to think carefully about the connection roles. If Stage 1 was rushed, the bill arrives now. The transition is always more expensive than the debt was, and the schedule you have between Thursday afternoon and Friday morning will not absorb the difference.

The first piece of work, which is the migration that does not change the schema

The cursor is on the migration file. You have called it 2026/05/stage_2_sentinel_replacement.sql because the name is the entire point. The Stage 1 default value on every tenant_id column across the schema is the constant sentinel UUID 00000000 0000 0000 0000 000000000001. That sentinel has been written into roughly eighty thousand rows over the past three months. It exists on every row of documents, on every row of events, on every row of the three domain tables you added in early June, and on every vector in the pgvector index. The application code has been filtering on it from day one. The discipline from Decision 5 (the q function that threads tenant context through every read and write) has been in muscle memory since you built it on a Friday night in May.

The migration that turns Stage 1 into Stage 2 is one UPDATE statement and one default value change. The existing rows get rewritten so that tenant_id carries the founder's real UUID, which you also generated this morning. The default on the column is dropped so that new rows have to supply a tenant ID at write time, which is the change that makes the sentinel stop being a valid value. There is no ALTER TABLE ADD COLUMN. There is no backfill across nine tables. There is no data access layer rewrite, because q was already passing tenant_id through every query and the only thing that changes is which UUID it passes.

You write the migration. You run it against a copy of the production database first, which takes nineteen minutes because the events table has accumulated mass and the UPDATE has to rewrite every row. You roll it back. You run it again to be sure. The cost on production, when you run it for real later tonight, will be twenty minutes of locking on the events table during a window when nobody is using the agent. The shape of the work is the shape of an oil change, not the shape of an engine replacement.

My read is that this is the most important sentence in the post. The Stage 2 migration is one UPDATE statement because the Stage 1 column was already there, populated with a sentinel, filtered on by the application code. If the column was not there in Stage 1, the migration tonight is ALTER TABLE ADD COLUMN across however many domain tables you have, plus a backfill, plus a careful rewrite of every query in the data access layer, plus a freeze window in which the agent is paused because you cannot reason about partial rollouts of a column the application code conditionally filters on. That work is not nineteen minutes. It is, in my read, somewhere between three days and three weeks, depending on how many tables you have and how disciplined your data access layer was. Either way, it is not work that fits inside the window between Thursday afternoon and Friday morning.

The honest version of the claim, which I will hold to throughout this piece: pay the Stage 1 debt now or pay it during this transition. The transition is more expensive than the debt was. The bill is the same bill; it is the size of the bill that changes.

The second piece of work, which is the policy that has been waiting since May

The cursor moves to a new migration file. 2026/05/stage_2_rls_policies.sql. This is the file you have been thinking about since you read the Postgres row level security docs in late June, which was the week the second design partner email arrived in your inbox and started the conversation that landed on the sticky note now stuck to the bezel.

The policy itself is short. It reads roughly like this on every domain table.

CREATE POLICY tenant_isolation ON documents
  USING (tenant_id = current_setting('app.current_tenant')::uuid);

ALTER TABLE documents ENABLE ROW LEVEL SECURITY;
ALTER TABLE documents FORCE ROW LEVEL SECURITY;

Three statements. The first is the policy expression, which reads as "the row's tenant ID has to match the session's current tenant." The second turns the policy on for normal connections. The third is the one that does the work that most people miss the first time they ship row level security, and it is the work that earns the bulk of the section you are reading now.

ENABLE ROW LEVEL SECURITY turns the policy system on for the table. By a separate Postgres default, table owners are exempt from policies, and in most Stage 1 setups the owner is the role that ran the migration that created the table, which is often the same role the application uses to connect. The migration enables the policy correctly. The pg_policies table shows the policy is there. The behavior at the connection where it matters is that the policy is not enforcing, because the application is the owner and owners bypass by default.

FORCE ROW LEVEL SECURITY removes the table owner exemption. The policy applies to the application connection regardless of whether the application happens to own the table. This is the qualifier I would not have known to add the first time I shipped a multi tenant Postgres schema, and it is the one I would name first if a peer founder asked me what to look for in their Stage 2 review. The Postgres docs are explicit about it, but the explicitness is in the section a reader new to row level security tends to skim. The Supabase team's practical write up of the pattern names the same trap from the angle of a managed Postgres setup, and the pattern is the same regardless of where the database lives.

You write the policy on documents. You write it on events. You write it on the three domain tables that have grown since June. You add the ENABLE and FORCE lines to each. You commit. The file is forty one lines for five tables. Most of the lines are blank or comments. The policy expression is the same on every table because the column name is the same on every table, which is the Stage 1 discipline doing its quiet work three months after you wrote it.

There is one Stage 1 decision in this piece of work that is doing real load. The application code has been threading tenant_id through the q function on every query since May. The policy turning on this afternoon does not change which rows the application sees, because the application has been filtering on tenant_id voluntarily the whole time. The policy is a safety net, not a behavior change. If the policy were turned on against an application that had not been filtering, you would discover at the policy turn on that some queries return zero rows when they used to return everything, and you would be debugging the application code at the same time you were trying to onboard the partner. That is the worst kind of work to be doing at five in the afternoon on a Thursday.

The Postgres RLS docs are the right reading for the next thirty minutes if any of this is new to you. The Supabase write up is the right second read. The third read is your own application code, which you grep for direct database access that bypasses the q function. The grep should return zero results. If it does not, the rows the grep finds are the rows that will leak the first time someone outside your team runs a support query.

A short detour into a sibling scene, which is the failure mode that lives on the other side of this Thursday

Before the next piece of work, I want to put a sibling scene on the table the way the previous post did. The question of whether the policy you wrote a moment ago is actually enforcing is the most consequential one in the post, and the wrong answer to it is a failure mode I want you to be inoculated against.

Picture a different Thursday in a different apartment. Two co founders. They have shipped Stage 1 well. The tenant_id columns are everywhere, the application code threads them correctly, the events table has three months of organic history. The partner is logging in Friday. The two of them sit down on Thursday afternoon and do exactly what you did this afternoon, with the same migration files and the same policy expressions, and they run the same EXPLAIN against the same query and see the same plan that mentions the policy filter inline. The pg_policies table shows the policy is enabled on every table. They sign off the migration and grant the partner credentials.

Two weeks later, on a Tuesday afternoon, the partner is on a support call with one of the co founders. She runs an ad hoc query against her own data to show what she is seeing. The query returns rows that are not hers. They are rows from the founder's tenant, with the founder's customer names in them, in a window the partner has open while the founder watches.

The cause, when they figure it out later that evening, is that the application's database role had BYPASSRLS set as a Postgres role attribute from a Stage 1 debugging session months earlier. One of them had been chasing a slow query in March, had granted BYPASSRLS to the application role temporarily so they could see the unfiltered rows for the debugging session, and had never revoked it. The role attribute was set at the role level, not at the policy level, which meant the policy itself was perfect and the connection was bypassing it before the policy got to run. The pg_policies query they ran on Thursday showed the policy. The pg_roles query they did not run on Thursday would have showed the bypass.

The lesson that lands here is the one I would write on the wall in the apartment if I were them. The policy file is silent about whether the grants and the role attributes are correct. The policy can be perfect and the enforcement surface can be open at the same time, and the only artifact that tells you so is a query against pg_roles that the policy file does not require you to run. My read is that most founders who ship row level security for the first time pay this tax once. Some of them pay it during a partner conversation that they remember for the rest of the year.

Back to your apartment, which is now slightly darker because the lamp's bulb is the warm one and the sun has dropped another two degrees toward the window frame.

The third piece of work, which is the part where your relationship with the database changes

The cursor moves to a third migration file. 2026/05/stage_2_role_attributes.sql. This is the file that addresses the failure mode in the sibling scene directly, and it is the piece of work where, for the first time since you started building the system, your own database role becomes meaningfully different from the application's database role.

Up to this Thursday afternoon, the database has had three connections that mattered. The application connection, which the agent uses for every read and write. The migration connection, which CI uses when it applies schema changes. And your psql connection, which you use when you want to see what is happening. All three have been the same role, more or less, since the first weekend in May. The role had whatever attributes were convenient. Some weeks it had BYPASSRLS on, some weeks it did not. Nobody was watching, because nobody else was looking at the data.

Tomorrow morning someone else is looking. The file you are writing now is the one that changes the role topology to reflect that.

The application role loses BYPASSRLS. That is the first line in the migration, and it is the line that does the actual enforcement work that the policies from the previous section depend on. You write ALTER ROLE app_user NOBYPASSRLS; and you commit it. The role is now constrained to whatever the row level policy says it can see, on every table, on every connection. The change is small in characters. The change in posture is large.

The migration role keeps BYPASSRLS, because migrations have to be able to write across tenants without the policy machinery getting in the way. You name the role explicitly in the migration. You set it in the CI environment so that schema changes run under it and only under it. You document the role in the runbook. The acceptable use is named. The unacceptable use is the application role that handles HTTP traffic.

Your psql role, the one you use when you connect from your laptop, keeps BYPASSRLS. So does your co founder's. This is a judgment call I want to name explicitly, because the absolutist version of the discipline would say nobody bypasses anything ever. My read is that the founding team retaining their debugging surface at Stage 2 is the right call. Your co founder is still inside the founding team. The partner is the first operator outside it. The role hygiene at Stage 2 is about the application role and the partner facing connection, not about removing the founder's debugging access. The Stage 4 work, when the first regulated customer signs and the auditor asks about your debugging surface, will narrow this further. At Stage 2 the co founder retains the surface, with the bypass attribute documented in the onboarding notes so that whoever joins the team next can read what they are inheriting.

You write the three role attribute changes. You add a line to the runbook that names which roles have BYPASSRLS and why. You commit the migration. The migration is six lines long and it is, in my read, the migration that changes the most about how the system actually works, because it is the migration that decides who the database trusts.

There is a related piece of work in the same file that earns about half the surface area of the role change, and it is the connection pooler configuration. The way the policies enforce is that the session variable app.current_tenant carries the tenant UUID for the current request. The application sets it on the connection at the start of every request. The policy reads it on every query. If the connection pooler returns a connection to the pool with app.current_tenant still set from a previous request, and the next request gets that connection without resetting the variable, the policy reads the stale UUID and authorizes the wrong tenant's data. This failure mode is silent. No exception fires. The query returns rows that look fine.

The fix is to put the connection pooler in session mode if you are on PgBouncer, which means each client connection holds a server connection for the duration of the session and RESET ALL or DISCARD ALL is the responsibility of the application code. Or you put the pooler in transaction mode and you wrap every request in a transaction that explicitly sets the variable at the start. Or you use the newer pooler patterns where session variables are discarded between transactions by configuration. Whichever path you pick, you write down which path it is, and you write a thirty second test that verifies the variable does not bleed across connection reuse. The test opens a connection, sets app.current_tenant, runs a query, closes the connection, opens a second connection (which the pooler may return from the pool as the same physical server connection), and checks that current_setting('app.current_tenant', true) is null. If it is not null, the pooler is not enforcing the discard semantics and tenant context is bleeding silently.

The test is thirty seconds. The failure mode is the kind that takes you nine days to find later. I think this is the second piece of Thursday afternoon work that earns more sleep than its size suggests, and I think most teams that ship row level security on a tight schedule skip it. The connection role and the session variable are the two halves of the enforcement surface, and the policy file is silent about both.

The fourth piece of work, which is the small one that matters more than its size

The cursor moves back to the application code. There is a function called log_event that the agent calls every time it lands a side effect, and it has been writing to the events table since the first weekend in May. The signature takes an action string, a payload blob, and the tenant_id from the request context. The function generates a UUID for the event, writes the row, and returns. The events table has roughly eighty thousand rows of this organic history, and most of them carry the value 'agent' in the actor_kind column because that is the literal string you typed in the Stage 1 migration when you created the table.

What actor_kind does not yet have is real differentiation between the agent and the human writes. Some events are the agent taking an action on behalf of the founder. Some events are the founder taking an action directly through the admin panel your co founder built last weekend. Some events are the future ones the design partner will trigger tomorrow morning, which will themselves split between the agent acting on her behalf and her clicking buttons in her own session. Today, all of these get written with actor_kind = 'agent' because the log_event function does not know the difference.

The piece of work, in characters, is small. The function gets a second optional parameter, actor_kind, with a default of 'agent'. The callers that are human triggered pass 'user'. The callers that are agent triggered pass nothing and inherit the default. The migration is no schema change, because the column has been there since May. The application change is a one line addition to the function signature and a sweep through the call sites to set the value correctly where the caller is a human action.

The why is larger than the work. The Stage 4 audit story, when it comes, will rest on whatever the events table has accumulated organically over the months between launch and the auditor's first conversation. If the table has three months of actor_kind = 'agent' on every row, the auditor cannot reconstruct who did what. They can reconstruct that the system did things. They cannot tell which were the agent acting autonomously and which were the founder steering. The reconstructability claim, when you make it to the security committee in some quarter that has not arrived yet, has to land on three months of real values, not three months of nulls or three months of a constant default. The work this Thursday afternoon is what makes that reconstructability possible.

This piece does not light up an enforcement layer. There is no policy here. There is no role change. The actor_kind column is still free text, the events table is still a regular table, there is no hash chain, no append only constraint, no enum, none of the audit hardening that lights up at Stage 4 when the first regulated customer signs and the conversation about reconstructability begins. All of that work is later. Tonight the work is making sure that the column starts carrying meaningful values now, so that the Stage 4 work, when it lands, is additive on top of organic history instead of a backfill against logs.

You add the parameter to log_event. You sweep the call sites. The grep takes ten minutes. You commit. The function is now writing real values. The eighty thousand existing rows still carry 'agent' everywhere, which is correct enough for the events written before the partner existed (because they were all the agent), and the next eighty thousand rows will start carrying the differentiation. By the time the auditor conversation happens, however many months from now, the meaningful rows will outnumber the historical rows by an order of magnitude. The table will read as honest, because it will be.

A second detour, which is the failure that looks correct and is not

I owe you the second sibling scene before the thesis lands, because the first sibling scene was about the policy being correct while the role attributes leaked, and the symmetric failure is the policy being correct on every table and the relationships between the tables leaking instead.

Picture a solo founder in a different city. She ships Stage 2 carefully. She has read the Postgres docs end to end, she has written the policies on every domain table, she has run FORCE ROW LEVEL SECURITY on each one. She has revoked BYPASSRLS from her application role. She has configured her connection pooler with explicit discard semantics. She runs a cross tenant query test on every table before her partner logs in. Every test passes. She ships.

What she has missed is small. Her schema has a table called customer_data with a tenant_id column and a row level policy. It also has a table called customer_notes with a customer_id foreign key pointing to customer_data. The customer_notes table has its own tenant_id column and its own row level policy. Both policies are correct on their own tables. The partner runs a query against customer_notes after logging in, gets back her own notes, all of which carry her own tenant_id, all of which look right. She also gets back the customer_id values from those rows, which are UUIDs pointing into customer_data.

Some of those customer_id values, in a corner of the data the founder did not check, point at customers in the founder's tenant. The relationship was created earlier in the system's life when the application code had a path that joined a partner facing form to a founder owned customer record, for reasons that seemed good at the time and have since been forgotten. The row level policy on customer_notes is keyed on customer_notes.tenant_id, which is the partner's, so the row passes the policy. The foreign key value in that row points at a customer in a tenant the partner cannot see directly, but the value itself is now in the partner's hands.

The IDs are UUID v7 from Decision 3 of Stage 1, so the leak is soft. The partner cannot derive the founder's customer count from sequential IDs, because there are no sequential IDs. The IDs are opaque. But the partner now has a list of customer UUIDs that exist in a tenant that is not hers, and a query through any application path that joins customer_notes.customer_id to customer_data.id would unmask them if the policy on customer_data happens to be loose, or if a future code path is written against a connection role that bypasses, or if a debugging session three months from now exposes the join in a place the founder did not expect.

The lesson, when she figures it out a week after the partner logs in and surfaces it to her own runbook before any actual unmasking happens, is the lesson the second failure mode is here to teach. Row level security is scoped per table. A foreign key relationship is its own access path, and the per table policy does not see across it. The pre flight has to walk the foreign key graph, not the table list alone. Every foreign key in the schema needs to be confirmed against the referenced table's policy, and any orphan reference (a foreign key value pointing into a tenant the policy on the referencing table does not constrain) is a leak in waiting.

The fix is small once she finds it. The application code that created the cross tenant relationship is removed. The customer_id values that pointed at the founder's tenant are nulled out, or migrated to the partner's tenant if they belonged there in the first place. A constraint is added that says foreign key values must reference rows in the same tenant. The pre flight checklist gets a new step that walks the foreign key graph and confirms the relationship.

Back to your apartment, where you have now added the foreign key walk to your own pre flight, even though you had not been planning to, because the sibling scene above is the kind of failure mode that earns its place in the checklist by being silent at the moment of partner login and surfacing weeks later.

The foreign key walk, where the per table policy does not see across the reference Three table cards arranged horizontally, each with row level security enabled and a green RLS badge. A bold dashed purple arrow curves from customer_notes back up to customer_data, labeled FK LEAK PATH, marking the access path the per table policy does not constrain. A thin grey arrow connects customer_attachments to customer_notes as the safe FK that shares a tenant. A wide purple bordered panel at the bottom carries the lesson. The foreign key is its own access path Every table's policy is correct. The reference between them is not what the policy sees. customer_data RLS ON id uuid v7 tenant_id uuid name text POLICY tenant_id = ? customer_notes RLS ON id uuid v7 tenant_id uuid customer_id uuid FK POLICY tenant_id = ? customer_attachments RLS ON id uuid v7 tenant_id uuid note_id uuid FK POLICY tenant_id = ? FK LEAK PATH FK SAME TENANT WHAT THE POLICY DOES NOT SEE RLS is scoped per table. A foreign key is a second access path. Every row the partner reads from customer_notes carries a customer_id pointing into another tenant. The pre flight has to walk the foreign key graph, not the table list.

The thesis, which lands in one sentence and earns the eight that follow

You step back from the migration files. There are three of them now, sitting in 2026/05/ next to the Stage 1 migrations from May. You make a coffee, which you intend to drink this time instead of microwaving twice. You look at the three files as a set, and you write down on the back of an envelope, in pencil, the shape of the work that landed this afternoon.

One migration that replaces the sentinel tenant_id with real per tenant values. Three turn ons. Row level security, with ENABLE and FORCE on every table. Connection role hygiene, where the application role loses BYPASSRLS and the migration role keeps it. The actor_kind discipline in the events table, where the application code starts writing real values instead of the constant default.

That is the entire surface area of Stage 2 for a team that shipped Stage 1 well. The schema does not change. The data access layer does not change. The application queries do not change. The migration runs in twenty minutes. The role attribute changes are six lines. The actor_kind change is a function parameter and a grep sweep. The whole thing is half a day of work, maybe a full day if you have never written a row level policy before and want to think carefully about the connection roles. By the time the partner logs in tomorrow morning at nine, the system has flipped from "we trust ourselves" to "the database enforces who sees what," and the partner's first query reads only her tenant's rows because the policy is reading a column that has been waiting since May for a second value to enforce against.

If Stage 1 was rushed, the bill arrives now, and the bill is the work this section pretends does not exist. The column add across every domain table. The backfill that assigns existing rows to the founder. The data access layer rewrite that threads tenant_id through every query. The freeze window during which the agent has to be paused, because partial rollouts of a conditionally filtered column are not reasonable. The events table that was never written to, which means three months of audit history have to be reconstructed from log files or accepted as gone. The conversation with the partner that begins with "we need to push the launch by ten days," which is the kind of email that costs trust in a way that does not show up on the migration diff.

Stage 2 is one migration plus three turn ons. The schema is unchanged. The bill that arrives is the bill for whatever Stage 1 did not buy.
Stage 1 to Stage 2, the columns did not change Paired panels of the customer_data table on the left and right. The columns are identical. The badges below each panel show that the policy turned on, the application role lost BYPASSRLS, and the events table actor_kind values became real. Stage 2 is policies and roles, not columns Same table on both sides. Three badges below carry the entire posture change. STAGE 1 STAGE 2 customer_data id uuid v7 tenant_id uuid default '00000000...0001'::uuid name text created_at timestamptz ~80k rows, all sentinel customer_data id uuid v7 tenant_id uuid default dropped, NOT NULL name text created_at timestamptz real tenant UUIDs, two and counting SAME SAME ROW LEVEL POLICY RLS off, app code filters voluntarily ROW LEVEL POLICY FORCED, policy on tenant_id APPLICATION ROLE BYPASSRLS implicit Stage 1 risk APPLICATION ROLE NOBYPASSRLS policy enforces on app EVENTS.ACTOR_KIND 'agent' free text, constant default EVENTS.ACTOR_KIND 'agent' | 'user' real per write values

What does not light up at Stage 2, and why each absence is a decision

There is a list, and it is half the point of the post, because the discipline of refusing to do things that are not on the small set is half of what makes Stage 2 finishable in the time the partner's NDA timeline allows.

Role based access control stays off. The partner has two humans in her account tomorrow morning, herself and her co founder. Both of them have full access to the partner's tenant. There is no admin role, no member role, no viewer role, no permissions framework. The trigger that promotes RBAC from a deferral into a real piece of work is the partner growing her account to a third human who needs a different slice of the same data. Until that happens, you have nothing to model, and the roles table you might build today would model a permission structure your first partner has not asked for. The Brooklyn team in the previous post is the cautionary scene for this one. Wait for the customer.

Agent secret separation stays off. The agent still reads its API keys from environment variables. The keys are the same keys they were on the Stage 1 Friday. There is no dereference layer, no separation between the secrets the agent needs to call third party APIs and the operator surface that holds them. The trigger that promotes secret separation is the first regulated customer signing and asking how API keys and PII flow through the agent, which is Stage 4 work. Until then, the agent's secret surface is the same as the founder's secret surface, because the agent is operated by the founder. The partner does not see the secrets. The partner does not need to see them. The deferral holds.

Full audit hardening stays off. The events table is still a regular Postgres table. There is no hash chain, no append only constraint, no actor_kind enum, no non repudiable signing of writes. The events written between now and Stage 4 will give the auditor real history to work against, but the table itself is the same table from Stage 1. The trigger is the first auditor asking for reconstructability, which is Stage 4, and the work then is additive on the column shape and the integrity constraints, not on the rows themselves.

Idempotency key enforcement stays off. The tool signatures from Decision 6 of Stage 1 hold, which means the agent generates a request ID for every tool call and the tool implementation accepts the ID and ignores it. The enforcement layer that reads the ID and rejects duplicates inside a window is Stage 3 work, lit up by a paying customer for whom a duplicate refund is real money. The partner is not a paying customer tomorrow. The cost of a duplicate at Stage 2 is a debugging story, not a customer story.

The outbox stays off. The agent's tool call and the database write are still close enough that a single transaction is enough. There is no dual write pressure between Postgres and an external system that has to be reconciled. The trigger is Stage 3 paying customers, where dropped side effects start having external consequences. Until then, the transactional boundary is enough.

Each absence has a trigger. The trigger is the next operator type, the next consequence shape, the next customer ask. None of the triggers is "the team had bandwidth this week." Building any of them earlier is early, not unfinished. The deferrals are the decisions, the same as they were in Stage 1. The list of decisions is the list of things that are not on the small set.

What lights up at Stage 2, and what holds The Stage 1 deferral heatmap, now with Stage 2 cells filled in solid purple for the items that turn on this Thursday afternoon. Every other row holds at half intensity, still deferred. Two cells lit at Stage 2; the rest hold Row level security and actor_kind discipline turn on. Everything else waits for its trigger. dormant lights up required at this stage 02 Co-builders first NDA'd operator 03 Paying contracts signed 04 Regulated SOC2 / HIPAA ask lands 05 Multi-agent concurrent at consequence 05+ Almost never documented requirement Row level security policies required Roles table, RBAC lights up Idempotency key enforcement lights up Outbox pattern lights up Materialized views, replicas, cache lights up Agent secret separation lights up actor_kind discipline, real values required lights up GDPR tombstones, append only lights up Multi region, failover, sharding lights up Event sourcing if ever Stage 2 is the second NDA'd operator. Two cells turn solid; the rest stay in their seat until the trigger calls them up. The actor_kind discipline lands now as preparation, so Stage 4 audit work is additive against organic write history.

Thursday morning, when the work earns the night's sleep

It is now Thursday morning. The light is the gentler one, the coffee is hot for once, and there are roughly thirty minutes between the moment you sit down at the desk and the moment the first meeting on your calendar will want your attention. The migration files were committed and merged last night. The role attribute changes were applied to staging at eleven and to production at midnight, during the quiet window when nobody was using the agent. The actor_kind sweep landed in the same deploy. By the time the partner logs in tomorrow morning at nine, the system has been running with the Stage 2 posture for slightly less than thirty six hours.

What runs this morning is the pre flight. Eight checks, in order. Each is small. The point of the rhythm is not the checklist as an artifact. It is the half hour between the second cup of coffee and the first meeting, the half hour you trade for the night of sleep before the partner logs in.

First, the tenant_id non null sweep. You run one SELECT COUNT(*) WHERE tenant_id IS NULL against every domain table. The expected result is zero everywhere. If any table comes back with a positive count, the sentinel was not consistently written during Stage 1 and the migration last night could not have backfilled it correctly. You would have seen the migration fail if this were the case, but the sweep is the cheap confirmation. Three minutes, six tables. Zero everywhere. You move on.

Second, the sentinel sweep. One SELECT DISTINCT tenant_id FROM

against every domain table. The expected result is exactly two distinct values on tables that have already received writes from the partner's staging session (the founder's UUID and the partner's UUID), and exactly one value (the founder's UUID) on tables that have not yet. If any table returns three or more distinct values, a third tenant existed without anyone noticing, and the Stage 2 transition happened silently a while ago. Three minutes, same six tables. The values look right. The partner's UUID matches the one on the sticky note.

Third, the RLS active sweep. One query against pg_class joined to pg_namespace, filtering to your application schema, checking relrowsecurity and relforcerowsecurity on every domain table. The expected result is both columns true everywhere. The query is one line, the result is a six row table that you eyeball in two seconds. Both columns true on every row.

Fourth, the policy existence check. One query against pg_policies filtered to your schema. The expected result is a row per domain table. The query takes one minute to write and one second to run. The result has six rows. You read them out loud, because reading the policy expressions out loud is the cheap way to catch a typo that an eye scan misses.

Fifth, the BYPASSRLS audit. The most important pre flight step, in my read, because it is the step that addresses the first sibling scene's failure mode directly. The query is SELECT rolname, rolbypassrls FROM pg_roles WHERE rolname IN ('app_user', 'migrator', 'leo', 'partner_session');. The expected result is app_user returning false, migrator returning true (acknowledged), your own role returning true (acknowledged), and the new partner_session role returning false. You read the four rows. The application role returns false. The migration role returns true. Your role returns true. The partner session role returns false. The configuration matches the runbook line you added last night. You move on.

Sixth, the cross tenant query test. This is the test that, if I were running this, I would also wire into CI as a permanent regression check on every PR after this Thursday, because the failure mode is the kind that creeps back in through application code paths a year from now. The query, in shape, is the following.

SET app.current_tenant = '';
SELECT count(*) FROM 
  WHERE tenant_id != current_setting('app.current_tenant')::uuid;

You run it against every domain table. The expected result is zero on every table. You set the variable to the partner's UUID, run the count against documents, the count against events, the count against the three domain tables. Zero everywhere. You then set the variable to the founder's UUID and run the same counts. Zero everywhere. The policy is enforcing in both directions. Eleven minutes, the longest step in the pre flight, because you run it twice (once per tenant) and you check the result carefully both times.

Seventh, the foreign key walk. The step the second sibling scene exists to motivate. You query the information schema for every foreign key in the application schema. The result is a small table, six or seven rows in your case, each row naming a referencing column and the referenced table. For each foreign key, you confirm two things. The first is that the referenced table has row level security active on it. The second is that the referencing column is constrained to reference rows in the same tenant. The first is a query against pg_class for each referenced table. The second is more involved, because it is a check that the foreign key values do not orphan across tenants, and the cheap version of the check is a query that joins the referencing column to the referenced table's tenant_id and confirms equality. You run the checks. Three rows pass cleanly. Three rows would also pass cleanly if the previous post's policy were not also doing some of this work. Six minutes.

Eighth, the connection pooler smoke test. Thirty seconds, the smallest step. You open a connection from a separate shell. You set app.current_tenant to an arbitrary value. You run a query. You close the connection. You open a second connection, which the pooler may or may not return as the same physical server connection. You run SELECT current_setting('app.current_tenant', true);. The expected result is null. If the pooler returned the same server connection without discarding the GUC, you would see the previous value. You see null. The pooler is enforcing the discard semantics correctly. You move on.

The whole rhythm is thirty one minutes by my measurement. The first cup of coffee covered steps one through three. The second cup covered steps four through six. Steps seven and eight ran while you were eating the second half of a croissant that was actually breakfast and not lunch, for once. The first meeting on your calendar starts in three minutes. The pre flight passed. The partner logs in tomorrow morning at nine.

Thursday morning, 30 minutes Eight pre flight check cards arranged in two rows of four above a thirty minute timeline. Each card carries a numeral, the check name, expected result, and a thin line connecting it to its minute mark. Thursday morning, 30 minutes Eight checks across two cups of coffee. Pass on all eight, ship at noon. 01 0:02 tenant_id non null sweep EXPECTS zero on every table 02 0:05 sentinel sweep, DISTINCT tenant_id EXPECTS one or two UUIDs, no more 03 0:09 RLS active sweep, pg_class join EXPECTS enabled and forced, true 04 0:13 policy existence, pg_policies EXPECTS one row per table 05 0:17 BYPASSRLS audit, pg_roles EXPECTS app false, named roles true 06 0:21 cross tenant query, both directions EXPECTS zero rows, every table 07 0:25 foreign key walk, graph not table list EXPECTS no orphan cross tenant FK 08 0:28 pooler smoke, GUC discard EXPECTS setting returns null 0:00 0:10 0:20 0:30 Steps 1 to 3 cover the first cup. Steps 4 to 6 cover the second. Steps 7 and 8 run with the croissant.

A note for regulated readers

If you are building for regulated customers from day one (HIPAA, PCI, GLBA, SOC 2 audit on the horizon), Stage 2 is still upstream of your real ask, and most of this post sits beneath the floor of what your customer's first contract will require. The work in this post is the row level isolation and the connection hygiene that the AC family of NIST 800-53 names as the first tier of access control. The audit hardening, the secret separation, and the role differentiation work that the rest of the AC family and the AU family require are Stage 4 work, additive on top of the foundation this Thursday afternoon set. Nothing in this post is legal or compliance advice, and the regulatory reference is for orientation, not interpretation.

Friday morning, with the sticky note coming down

It is now Friday morning. The lamp is off because the autumn sun has come up enough to read by. The agent has been running with the Stage 2 posture for almost two days. The events table now carries about eighty thousand rows of organic Stage 1 history and the first three rows of Stage 2 history, each one carrying the founder's UUID and actor_kind = 'agent'. The cross tenant query test you wrote into CI yesterday afternoon ran on the deployment that went out at six this morning and passed cleanly on every table.

At nine oh four, the partner logs in. Her browser hits the application. The application sets app.current_tenant to the UUID that has been on the sticky note for two days. Her first query runs against documents, returns zero rows (because she has not yet uploaded any), and the policy reads the column that has been waiting since May for a second value to enforce against. The policy returns zero rows, which is the right zero rows.

You stand up from the desk for the first time this morning. You take the sticky note off the bezel of the second monitor. You fold it once. You open the top drawer of the desk and you put the sticky note inside, on top of the printed copy of the NDA and the receipt for the birch slab. You close the drawer. The partner's UUID is now in two places. The database, where it is enforcing. The drawer, where it is the artifact of a transition that happened.

The shape of the work this Thursday and Friday was the shape of the work Stage 1 made cheap. One migration that replaced a default value. Three policy turn ons that activated layers the Stage 1 discipline had been writing toward since May. Eight checks that ran in thirty minutes. The schema did not change. The application code did not change. The migration files in 2026/05/ were the artifact of a transition that the Stage 1 files in 2026/04/ and 2026/05/ made possible. The Stage 2 prize is that the partner's first query reads only her rows, and the founder's first query reads only the founder's rows, and the database is the thing enforcing the difference instead of the application code being the only line of defense.

The frame the whole post has been building toward lands here. If Stage 1 was done well, Stage 2 is half a day. If Stage 1 was rushed, the bill arrives during this transition, and the transition is more expensive than the debt was. The Stage 1 decisions were the cheap insurance. The Stage 2 work is the moment the insurance pays out.

The next architect of this system, whoever that turns out to be, will open the migration files and the shape will read as deliberate, because it was. The role attribute changes will read as decisions, because they were. The cross tenant query test will read as the regression check that catches a class of failure the human eye cannot see, because it does.

If you are at the Stage 1 to Stage 2 boundary and any of this resonates, I would welcome a conversation. The contact form at the top of the site goes to my inbox.

Up next: Stage 3, when paying customers change the math on idempotency and roles.

Frequently Asked Questions

I shipped Stage 1 without `tenant_id` columns. How big is the Stage 2 migration?

Directionally, it is a multi day migration with a freeze window, and the conversation you have with your design partner about the timing is the part you have not budgeted for. The column add itself is cheap. The backfill across however many tables you have is bounded. The data access layer rewrite is the largest single piece, because every read path and every write path has to learn to thread the tenant context through, and the queries that you did not write yourself (the ones from the early weekends when the schema was small) have to be found by grep and edited one at a time. My honest answer is that this is longer than a design partner's NDA timeline can absorb in one sitting, and the right move is to surface the timing to the partner before you commit to a launch date that pretends the work is smaller than it is.

Do I really need RLS if my application code already filters by tenant?

Yes. The application layer filtering is the discipline; the row level policy is the enforcement. The two failure modes that make the policy earn its keep are, first, a developer adding a new query path six months from now that forgets the filter, and second, a connection getting reused with stale session context that the application did not reset. Both failures are silent without the policy. The application filter is the first line of defense; the policy is the line the database itself enforces, on every connection, on every query, regardless of what the application code did or did not do. The <a href="https://www.postgresql.org/docs/current/ddl-rowsecurity.html" target="_blank" rel="noopener noreferrer">Postgres RLS docs</a> are the canonical reading on the mechanics.

What about `BYPASSRLS`, when is it acceptable?

Migration tooling needs it, because schema changes have to write across tenants. The founder's interactive psql role can keep it at Stage 2, because the founder is still inside the founding team and retains a debugging surface. Any backfill script that writes across tenants needs it. The role that handles HTTP traffic from the application, the one your partner's session is going to use through your application server, never has it. The acceptable uses are explicit, named in the runbook, and documented in onboarding notes. The unacceptable use is the production application role. The role table is where the audit lives, and the migration that drops <code>BYPASSRLS</code> from the application role is the one that does the actual enforcement work the policy depends on.

Can my co founder still bypass tenant isolation?

Yes, through the founder psql role, and that is fine at Stage 2. The co founder is still inside the founding team. The partner is the first operator outside it. The role hygiene at Stage 2 is about the application role and the partner facing connection, not about removing the founder's debugging access. The Stage 4 work, when the first regulated customer signs and the auditor asks about your debugging surface, will narrow this further. At Stage 2 the co founder retains the surface, with the bypass attribute documented in the onboarding notes so that whoever joins the team next can read what they are inheriting.

Do I need a tenants table at Stage 2?

A one row tenants table is sufficient for the partner. A two row table including the founder's tenant is the honest minimum. The schema is <code>(id uuid, name text, created_at timestamptz)</code>. No billing fields, no roles, no settings, no plan column. The Stage 3 trigger (paying customer signs, billing fields earn their keep) is what adds the columns. At Stage 2 the table exists so the partner's UUID has a name to attach to, which means the audit log can render the tenant name a year from now without joining against a stub.

How do I test that RLS actually works?

The cross tenant query test from the pre flight, run as a permanent CI check after Stage 2. Set <code>app.current_tenant</code> to the partner's UUID. Run <code>SELECT count(*) FROM <every domain table> WHERE tenant_id != current_setting('app.current_tenant')::uuid;</code>. Zero on every table is the pass condition. Set the variable to the founder's UUID. Run the same query. Zero on every table is the pass condition. Wire it into CI. Make the test fail the build if any table returns nonzero. The regression you are guarding against is the one that creeps back in through a new table that someone adds in six months without enabling row level security on it, which the policy file is silent about and the CI check is loud about.

What is the right tenant identifier shape, UUID, slug, or both?

UUID is the primary identifier, and at Stage 2 the post stops there. The slug is a Stage 3 product surface question that the URL layer will need when the partner's account has a custom subdomain or a friendly URL or anything else that has to render in a browser address bar. If you must answer the question now, put the slug in the tenants table as a nullable column and leave it null until product surfaces ask for it. The UUID is the identity. The slug is a presentation layer concern that has not earned its surface area yet.

Code Atelier · NYC

Ready to get agent-ready before your competitors do?

Let's talk