What I built when 'ask the LLM to respect hexagonal' kept failing

Ask the LLM to "respect hexagonal" ten times. You'll get ten different interpretations, and at least three with framework imports in the domain. It solves the function; it doesn't hold the form.

This is the pattern that wore me down. The agent gets the endpoint working. The query is right. The parsing handles the edge cases. And then I open the diff and the domain module is importing FastAPI, because somewhere along the way the model decided that was a reasonable shortcut. I write the prompt again, sharper. The next iteration is clean. Two refactors later, the imports are back. The contract was in my head, and the model has no head.

The Concrete Pain

The thing the LLM does well is solve the functional layer. Give it an endpoint to build, a query to write, a parser to assemble, and you get something that mostly works on the first or second pass. The capability is real and it's not what's hurting. What's hurting is the form: layers, ports and adapters, the discipline that keeps the domain free of frameworks, the observability hooks that let you debug it at 3 AM. That layer degrades every iteration, and it degrades silently.

The degradation isn't dramatic. It's small drift. A use case that pulls in SQLAlchemy because "it was easier this way". A port that gets bypassed because the adapter "already had the right method". A container.py that quietly stops being the only wiring point. None of these break tests. None of these light up in code review if the reviewer is tired. They accumulate, and three months later the codebase is hexagonal in name only.

The second-order damage is what makes it expensive. Once the form has drifted, every subsequent agent run inherits the drift. The model sees the current state of the codebase, treats it as the implicit specification of what the team accepts, and produces more of the same. The drift compounds, the same way technical debt compounds, except faster — because the agent can produce twenty new modules in the time it takes a human to write one, and every one of them is calibrated against a baseline that's already wrong.

And the deeper problem underneath all of this is that no prompt survives a context refresh. You spent two weeks crafting the perfect system prompt for your team's stack — the one that finally got the model to stop putting framework imports in the domain. Then someone clears the context window. Or you start a fresh session. Or the model version updates. The prompt is gone, or it's there but the model interprets it differently, and you're back to square one. The contract was ephemeral by construction. You built a fence out of mist and you've been surprised every morning that the sheep are out.

Why Prompts Don't Scale as a Contract

A prompt has none of the properties you need for governance. It isn't versioned. You don't know what changed between the prompt the team was using in February and the one they're using in May. There's no diff. There's no commit history. There's a Slack message somewhere with "hey I tweaked the system prompt, it works better now" and that's the entire change record.

It isn't auditable. If a regulator or an internal auditor asks you "what instructions was the agent operating under when it produced this code on March 15", you have no answer. You can guess. You can show the current prompt. You cannot reconstruct the actual contract that was in force at the time. For anyone operating under SOC 2, ISO 27001, or any regulated industry, that gap is not a nuance — it's the gap.

And it isn't read the same way by two different people on the same team. The senior who wrote the prompt has the tacit context that fills in the blanks. The new joiner reads the same words and parses them differently. The model itself, on different calls, interprets the same words with different weight. There's no test that fails when the prompt degrades, because there's no spec the prompt is meant to satisfy. The prompt is the spec. When it drifts, the spec drifts, silently.

That's the gap that pushed me to build AgentGuard. Not because prompts are bad — they're useful — but because they can't be the unit of governance for code-generating agents in a serious enterprise context. Something else has to carry that load.

What Changed When I Moved the Contract Outside

The move was simple to describe and uncomfortable to make. Take the rules that were living inside the prompt — "domain has no framework imports", "ports are ABCs", "every use case returns a Result", "container.py is the only wiring point" — and put them in a versioned YAML file. The file lives in a repo. It gets reviewed via pull request. It gets tagged with SemVer. The agent reads it on every call.

A built-in archetype like api_backend (one of the ones shipped in rlabs-agentguard — alongside library, cli_tool, react_spa, web_app, script, and the debug variants) looks roughly like this:

archetype: api_backend
tech_stack:
  language: python
  framework: fastapi
  orm: sqlalchemy
  test: pytest
structure:
  - domain/
  - domain/ports/
  - adapters/
  - container.py
self_challenge_criteria:
  - no_framework_imports_in_domain
  - ports_are_abc
  - use_case_returns_result
  - container_is_single_wiring_point
scoring_weights:
  type_safety: 0.9
  modularity: 0.8
  observability: 0.7

The LLM no longer has to "remember" hexagonal. The archetype carries it. Every call gets the same structured input. When the team decides the rule needs to change — say, the use case should return a Result[T, E] instead of just Result — the change goes through PR review, lands as v1.4.3, and from that tag forward every agent run picks up the new contract.

The thing that changes psychologically is that the conversation with the model becomes thinner and the conversation with the archetype becomes thicker. You stop arguing with the model about whether SQLAlchemy belongs in the domain. The archetype says it doesn't. The model either complies or the validation catches it. The argument moved out of the chat window and into a file you can version, diff, and review.

And the team dynamics shift in a way I didn't expect. The architectural conversations that used to happen as commentary inside pull requests — "hey, why is this importing from the framework layer" — start happening earlier, against the archetype itself. The right level of debate moves from per-PR to per-archetype-version. You're no longer relitigating the same architectural principle on every change. You're litigating it once, lifting it into the contract, and then the contract enforces it across every subsequent run. The number of code-review cycles drops, and the ones that remain are about logic instead of about shape.

The Short Pipeline: Skeleton -> Contracts -> Wiring -> Logic

The other thing that fell out of building this was that the work naturally decomposes into stages. You don't need the agent to produce the whole project in one shot. You need it to produce the skeleton first — the directory structure, the empty modules, the pyproject — against the archetype's structure declaration. Stop. Look. If the skeleton is wrong, every subsequent stage will inherit the wrongness, and catching it here costs nothing.

Then the contracts and wiring: the port ABCs in domain.ports, the adapter signatures, the container.py composition. Again, stop. Look. The archetype's self_challenge_criteria runs over this stage and surfaces any rubric violation. If a port isn't an ABC, you see it now, before any logic has been written against it.

Then the logic — the use cases, the actual business operations. This is the stage where the LLM is at its strongest, because the structural decisions have already been pinned down by the previous stages. The model is no longer juggling "what's the right architecture for this" and "what's the right algorithm for this" simultaneously. It's just doing the second one.

Then the validate-and-digest pass: run the full rubric, score the output, emit a digest that tells you which criteria passed, which failed, and at what weighted score. Token and cost tracing per stage are emitted as part of the pipeline output, so you know what each step cost and where the spend went. You can stop between any two stages and review. You can rerun a single stage without re-running the others. The pipeline is interruptible by design, because the decisions worth reviewing are also the decisions worth not redoing.

The cost-tracing piece deserves its own mention, because it's the one most teams underestimate when they evaluate this kind of tooling. When you run a multi-stage pipeline, you find out quickly that the stages are not equally expensive. Skeleton generation is cheap. Logic generation can be ten times the cost. Validation can swing wildly depending on how big the generated code is. Having the per-stage token and cost numbers in the pipeline output means you can make informed decisions about where to invest the smaller model versus the bigger one — skeletons might be fine on a cheap model, logic might warrant the better one, validation might benefit from a different model entirely for cross-check. Without the per-stage tracing, those decisions are guesses.

Demo #07 in the public agentguard-demo repo (gitea.rlabs.cl/rlabs-cl/agentguard-demo) walks through this end-to-end on a FastAPI note-api built against the api_backend archetype. Demo #09 in the same repo shows the harder case — a spaghetti codebase being migrated, where the archetype value shows up most clearly because there's a baseline to compare against.

Transferable Principle (For Your Stack, Not For AgentGuard)

I want to be careful here, because the post could easily slide into "and that's why you should use AgentGuard". That's not the point I'm making. The point is structural, and it applies whether or not you ever install rlabs-agentguard from PyPI.

The minimum viable unit of governance in enterprise AI isn't a prompt — it's a contract you can version. Whatever form that contract takes — YAML archetype, JSON Schema, ADR, OpenAPI spec, even a markdown file in a specific shape — the load-bearing property is the same: it lives outside the model, it has a history, it can be reviewed, and the audit trail asks a single question that has a single answer ("which version was in force on date X"). The prompt fails every one of those tests. The contract passes them.

Your stack already has tacit rules. Every codebase does. The senior engineers know that "we don't do circular dependencies between these two modules", that "this layer never imports from that one", that "every external call gets a timeout and a circuit breaker". Those rules live in three people's heads. Write them as a rubric. Not as a linter — a linter is binary and brittle — but as a rubric the agent can be evaluated against, in language the agent can parse. The exercise alone, before you give it to any agent, will earn its weight: the act of articulating the rule forces clarity, and the clarity surfaces the cases where the rule actually has exceptions you'd been quietly enforcing in code review without naming.

And then the harder, less comfortable principle: if you can't audit the rule, you can't audit the agent that follows it. The audit question — "what was the agent told to do, and how do we know" — has to have a clean answer or the agent isn't safe to run on anything that matters. Prompts can't give you that answer. Versioned external contracts can. Whatever tool you choose, that's the shape of the thing you're choosing.

Question for comments: which architectural rule on your team lives only in the seniors' heads today?

#AI #GenAI #Engineering #SoftwareCraftsmanship #CodeQuality