Shipping one feature across six repos without losing my mind

A feature that touches six repos isn't a feature. It's a project. And if it doesn't have a single migration plan and a single feature flag, it will fail — not in code, in coordination.

I'm writing this with the dust still on my hands from M3 — the milestone where AgentGuard got a team/organization tier. The feature shipped. The rollout was clean enough that we slept through the night. But it could very easily have gone the other way, and the difference between the two paths wasn't technical excellence in any one repo. It was the discipline of treating the whole thing as one operation, not six.

The Feature: Team/Org Tier in M3

The headline change looks small from a product slide. AgentGuard accounts can now belong to teams, teams can belong to organizations, and revenue from a marketplace sale can be split between the seller and the team via a seller_split_bps field. Underneath the slide, the data model gets four new tables: teams, team_members, pending_invites, integration_metadata. The billing pipeline has to learn about per-team plans. The TOS gets an amendment for the seller terms.

None of that is structurally hard. Each piece, considered alone, is the kind of work a competent engineer ships in a week. The trap is exactly that — each piece looks self-contained, which makes it tempting to let each repo move at its own pace, and that's where the operation actually goes wrong.

The Six Repos Involved

The feature touched, in order of where the changes had to land: agentguard-lib (the Python SDK gained entitlements-aware checks for the team plan), AgentGuard platform/api (Alembic migration 040_team_org_tier plus backfill for existing accounts), AgentGuard platform/web (the invites flow and the per-team billing UI), @agentguard/sdk TypeScript (typing for the new schema), rlabs-notifier (five new templates: invite-sent, invite-accepted, billing-attached, tos-amendment-required, team-plan-active), and rlabs-consent (the TOS amendment registered as seller_tos_m3_2026_04).

The library and one of the demo repos are public — rlabs-cl/agentguard-lib lives at github.com/rlabs-cl/agentguard-lib and gitea.rlabs.cl/rlabs-cl/agentguard-lib with the package on PyPI as rlabs-agentguard. The platform, the notifier, and the consent service are internal. The split between public and private matters for the rollout, because the constraint isn't symmetric — public artifacts are visible the moment they're tagged, and that constrained part of the order.

The Mistake I Almost Made

The default plan, the one that nearly happened, was the obvious one: each repo's owner gets the spec, picks the week that works for them, lands the change behind their own feature flag, deploys when ready. The flags would have been named differently because they were owned by different teams. The rollout dates would have spread over three weeks because that's what calendars do. The communication overhead would have been six standups, six PRs to review, six deploy windows to coordinate.

The failure mode of that plan is structural, not human. You'd have ended up with six partial states in production — the lib expecting an API shape the API hadn't shipped yet, the web flagged on for users whose backend rejected the new fields, the notifier emitting templates that referenced data nobody had backfilled. There'd be no consistent rollback because each flag rolled back its own slice without coordinating with the others. The combinatorics of "which subset of the six is currently on" generates 64 possible states, most of which are broken, all of which would surface as a user-facing bug nobody could reproduce.

The diagnosis is simple in retrospect: when a feature requires consistency across multiple deployable units, treating those units as independent is a category error. They aren't independent. They're one system with multiple deploy artifacts. And the operational unit has to match the architectural unit.

What Worked

The fix was to collapse the operation to one of everything. One chronological migration plan, written down once, ordered strictly across the six repos. One feature flag — m3_team_org_tier — propagated through the central config service, read identically by all six artifacts. One rollout owner — me, not a committee, with explicit veto on each step. One cross-system smoke test checklist built in Playwright, covering the end-to-end flows that crossed repo boundaries.

The single-flag piece was the load-bearing one. Once every artifact reads the same flag from the same source of truth, the state space collapses from 64 possibilities to 2: feature on, feature off. Rollback is a config change. Beta cohorts are a config change. Any artifact that needs to be deployed in a state where the flag is off is safe by construction, because the code paths gated by the flag don't activate. You can deploy the lib, the SDK, the API in any order you like as long as the flag stays off — and the flag only flips on once every artifact necessary for the user-visible behavior is in production.

The single-owner piece was the part I had to argue for, because it looked autocratic. It wasn't. The owner's job is to say no to deploys that violate the order, not to do the work. The work stayed distributed. The order stayed centralized. That separation is what kept the rollout coherent.

The Playwright checklist was the cheap insurance. Not exhaustive — exhaustive would have taken another sprint — but covering the cross-system flows where the seams between repos actually live. A test for "user accepts an invite, lands on the team page, is billed correctly" isn't testable inside any single repo. It only exists at the intersection. Writing those tests up front, before the rollout, forced the rollout owner to enumerate the seams explicitly. That enumeration was probably more valuable than the tests themselves.

How the Rollout Looked

The chronological order encoded the dependency graph. Step 1 was consent and notifier. Templates went live but referenced a flag that was off, so they were dormant. The TOS amendment was registered but not yet required for any flow. The point was to have the downstream-facing artifacts in place and tested before anything upstream could trigger them.

Step 2 was the library and the TypeScript SDK, both deployed behind m3_team_org_tier=off. New code paths existed in the binaries shipped to consumers, but they didn't execute because the flag wasn't on. The PyPI release of rlabs-agentguard was visible publicly the moment it was tagged, which is why the gating had to be airtight — anyone could install the new version, and the new version had to behave identically to the old one until the flag flipped.

Step 3 was the API with Alembic 040 plus backfill. The migration added the four tables. The backfill populated team_members and integration_metadata for existing accounts in a way that preserved single-user semantics until the flag enabled the team semantics. The API now spoke the new schema, but with the flag off, every external response still matched the pre-M3 contract.

Step 4 was the web with the flag enabled for the beta cohort. This was the only step where user-visible behavior changed, and by then every supporting artifact had been in production long enough to have absorbed any silent issues. The beta ran for nine days. Then the flag flipped for everyone.

What I Didn't Resolve Well

I want to be honest about the parts that didn't work, because they're more useful than the parts that did.

The TypeScript SDK versioning drifted from the API by about 48 hours. The SDK was published before the API canary was stable, which meant any consumer who installed the new SDK version in that window and tried to call certain endpoints got 404s on routes that hadn't been deployed yet. The cause was deployment-ordering: the SDK got tagged the moment its PR merged, on the assumption that the API was already at parity. It wasn't. The fix going forward is to gate SDK publication on a successful API canary, not on PR merge.

Three bugs slipped past Playwright in production. Two of them were variants of the same underlying issue: the cross-system end-to-end tests didn't yet cover the "team that contains a seller" case, only the "team" case and the "seller" case separately. The intersection had been on the test backlog and hadn't been written. The third was a notifier template referencing a field that the API renamed during review, which slipped past because the templates were tested in isolation, not against the API's actual output. Both failures point at the same gap — the tests covered each repo, not the seams between them.

The post-mortem produced one concrete commitment: the OpenAPI contract should be generated from the library, not written by hand in the SDK. The hand-written version drifts, silently, every time the lib changes shape. A generated contract makes the drift impossible — when the lib updates, the contract updates, and the SDK regenerates against it. That work is on the next milestone and it's the kind of investment that pays back across every subsequent cross-repo feature.

There's a deeper pattern in those three misses that's worth naming. All three are gaps at the boundaries between repos — versioning between SDK and API, e2e coverage at the team-with-seller intersection, contract drift between hand-written SDK types and the lib they were meant to mirror. The per-repo work was clean. The interstitial work was where things slipped. That observation aligns with the post's whole thesis: in a multi-repo feature, the repos themselves aren't the risk. The seams are. And every investment in making the seams visible — generated contracts, cross-system e2es, single-flag rollouts — pays back disproportionately, because the seams are where the failures actually live.

Transferable Principle

Three things, and they compound.

A cross-repo feature needs one plan, one flag, one owner — non-negotiable. Anything less and you're shipping a coordination problem dressed as a feature. The temptation to let each repo move at its own pace is strong because each repo's owner has their own calendar and their own backlog, but that's the temptation the discipline exists to resist. The architectural unit and the operational unit have to match.

Cross-system end-to-end tests are worth what they weigh. If they don't exist, the rollout is blind. Per-repo tests catch per-repo bugs. The bugs that take the rollout down live in the seams between repos, and the seams have to be tested explicitly. Build the cross-system test before you ship the cross-system feature, not after the bug report.

And the one I'd want any engineering leader to hold onto: if you have more than three repos involved in a single feature, ask whether the feature should have started as a single monorepo and been split later, rather than started distributed and forced to coordinate. The reverse — collapsing distributed repos into one — is far more expensive than the original split. The decision to multi-repo has gravity. It deserves the same scrutiny as any other architectural decision, and "we already have it that way" is not a sufficient answer when the next M-something rolls around.

Question for the comments: what was your worst cross-repo rollout — and which of the three was missing: the plan, the flag, or the owner?

#Engineering #Architecture #TechLeadership #SaaS #CTO