From prompt to primitive: why I stopped iterating on prompts

A prompt doesn't get versioned, doesn't get audited, and two people on the same team read it differently. If that's your unit of governance, you're not governing.

I've been writing prompts for code-generating agents for about three years now, and the realization that pushed me to stop treating them as the primary contract was small and embarrassing: I couldn't tell you, with confidence, what prompt my team was using in production last Tuesday. I could show you the prompt that was in the file now. I couldn't reconstruct the one that was actually running when the deploy went out. And once I noticed that gap, I couldn't unsee it.

What a Prompt Can't Do

The properties a prompt lacks are the same properties any serious operational artifact needs, and the lack isn't a matter of discipline — it's structural.

No history. You don't know what changed between v3 and v4 of the system prompt. There's no diff. There's no commit message. There's a person who remembers tweaking it and there's a Slack thread somewhere with "hey I made it a bit more strict". When the agent's output quality drops next week, you have no clean way to know whether the prompt changed, the model changed, or the input distribution changed. The variable that was supposed to be controlled isn't tracked.

No CI failures. Prompts fail in production, silently. There's no test that runs at PR time and says "this prompt change broke the guardrail against framework imports in domain". The only test is: ship it, watch the codebase drift over the next three weeks, and notice that something is off in a retrospective months later. The feedback loop is so long that by the time you have the signal, you can't trace it to the cause.

Not reviewable by standard code review. Prompts live in .py files as triple-quoted strings, or in .txt files, or in environment variables, or in tool configurations that aren't checked in. A pull request that changes a prompt looks, to the reviewer, like a string change. The reviewer reads it as text. They don't have the structured eye they'd bring to a function signature change. The semantic weight of the change is invisible in the diff.

If those three properties — version history, CI enforceability, structured review — are the minimum you'd demand of anything load-bearing in your codebase, then a prompt isn't load-bearing-capable. The conclusion isn't "throw out prompts". Prompts are useful for the conversational layer, the one-off, the exploration. The conclusion is: prompts can't be the unit of governance for the parts of the system that have to be auditable.

What an Archetype Is, Concretely

An archetype is what the prompt was trying to be, in a form that has the properties the prompt lacks. In rlabs-agentguard, an archetype is a YAML file with a fixed shape:

archetype: api_backend
version: 1.4.2
tech_stack:
  language: python
  framework: fastapi
  orm: sqlalchemy
structure:
  - domain/
  - domain/ports/
  - adapters/
  - container.py
pipeline:
  levels:
    - skeleton
    - contracts_and_wiring
    - logic
    - validate
validation:
  checks:
    - no_framework_imports_in_domain
    - ports_are_abc
    - use_case_returns_result
self_challenge:
  criteria:
    - container_is_single_wiring_point
    - every_adapter_has_test
scoring_weights:
  type_safety: 0.9
  modularity: 0.8
  observability: 0.7

The library ships a set of built-in archetypes for the common shapes: api_backend, library, cli_tool, react_spa, web_app, script, debug_backend, debug_frontend, plus a few non-code archetypes (business_plan, legal_contract, scientific_paper) for the meta-software use cases that aren't purely engineering. You start from one of these. You fork it. You adjust the tech_stack to your stack, the structure to your conventions, the checks to your team's tacit rules. The fork lives in your repo, alongside the code it governs.

The properties this gives you are the inverse of the prompt's failures. The archetype has history — git tracks it. It fails in CI — the validation pass runs against generated code and emits structured failures. It's reviewable in PR — a change to validation.checks shows up in the diff with the same structural weight as a function signature change. And it gets tagged with SemVer, which means "what was the agent operating under on date X" has a clean answer: the tag that was current at the time.

The LLM receives the archetype as structured input on every call. There's no "hope the prompt survives". The contract is re-injected, identical, every time.

The Mental Shift: Contract, Not Instruction

The move that took me longest to internalize wasn't the technical one. It was the semantic one. A prompt says "do X". An archetype declares "the result must meet X, Y, Z". Those look superficially similar. They aren't.

A prompt is an instruction. It puts the burden of remembering on the model. The model has to interpret it, hold it in context, and produce output consistent with it — turn after turn, in a context window that's filling up with other things, with no mechanism to re-anchor when the interpretation drifts.

An archetype is a contract. It declares the shape the output has to satisfy and provides the mechanism to check whether it does. The model is no longer responsible for remembering the rules. The contract carries them. The validation pass checks them. The model's job shrinks to "produce output that passes", which is a much narrower and more tractable task than "behave consistently with a long prose document across an indefinite session".

The auditability change is the one that matters most to enterprise consumers. With a prompt, the audit question — "what version of the instructions was the agent operating under on March 15" — has no good answer. With a versioned archetype, the question becomes trivial: check the tag that was current on March 15, pull that version of the YAML, that's exactly what the agent saw. The audit trail isn't a forensic exercise. It's a lookup.

What the Switch Cost

I want to be honest about the friction, because the switch is real and the cost is real.

Letting go of chasing "the perfect prompt". There's a particular satisfaction to refining a prompt — a sentence here, a clarifying clause there, a structural reorganization that finally gets the model to do the thing. That work doesn't disappear when you move to archetypes, but its locus changes. You're no longer crafting prose. You're declaring criteria. The aesthetic is different, and the people on the team who got good at prompt-craft may feel like they're being asked to do something duller. They are, in a sense. They're being asked to do something more durable instead.

Writing down the team's tacit rules. This is the uncomfortable one. The team has rules. Some are in style guides. Most live in three people's heads and get applied in code review without ever being named. Writing them down as validation.checks means surfacing disagreements. "We always use Result types" turns out to mean "the backend team does, the data team doesn't". "Domain has no framework imports" turns out to have three exceptions nobody agreed on. The archetype forces the conversation, and the conversation is sometimes hard.

Accepting that no archetype starts complete. The first version of any archetype is wrong in interesting ways. It misses rules it should have. It includes rules that turn out to be wrong. It scores things in proportions that don't match what the team actually values. The archetype iterates with use, the same way a test suite iterates with bugs caught. Treating v1.0 as the permanent contract is a mistake. Treating it as the starting commit of a long history is the right frame.

Tooling discomfort. I'll add one more, because it's the one nobody warns you about: the existing tooling around prompts — the playgrounds, the prompt-management UIs, the A/B testing frameworks that vendors have been building for the last two years — doesn't fit. Some of it transfers (evaluation harnesses, in particular). Most of it was built around the assumption that the prompt is the artifact you iterate on, and once the artifact becomes a versioned YAML in your repo, the workflow lives in your normal engineering tools — git, PR review, CI — rather than in a separate prompt-management product. Teams that invested heavily in the prompt-tooling ecosystem can find this irritating in the short term, even when the long-term shape is clearly cleaner.

Transferable Principle

If your team already iterates prompts — and most serious teams using code-generating agents do — the next step isn't "better prompts". The next step is moving the contract outside the model.

The form matters less than the versioning. YAML is what rlabs-agentguard happens to use because it's reviewable and diff-friendly. JSON Schema works. An ADR works. OpenAPI works for the subset of contracts that are API-shaped. The load-bearing property isn't the syntax — it's that the contract lives in a versioned, reviewable, taggable artifact that the model receives as structured input rather than as embedded instruction.

Demo #09 in the public agentguard-demo repo (gitea.rlabs.cl/rlabs-cl/agentguard-demo) walks through the case where this matters most — a spaghetti codebase being migrated to a modern shape, where the archetype carries the destination form across hundreds of agent calls without drifting, because it doesn't depend on the model remembering anything between sessions.

And the second-order consequence: once you have the contract, the model becomes commodity. You can swap the underlying LLM — switch providers, change versions, A/B between two — and the contract remains stable. The output quality may shift, the cost profile may shift, but the specification of what the output has to satisfy doesn't move. That's the property that turns model selection into a procurement decision rather than a re-architecture, and it's the property that prompts, locked to a specific model's interpretive habits, can never quite provide.

This is the move that matters most for any organization thinking past the current vendor cycle. If your agent infrastructure is built on a prompt that was carefully tuned to Anthropic's interpretive habits or to OpenAI's, switching vendors is a project. Re-tuning. Re-testing. Re-deploying. If the contract lives outside the model — in an archetype, in a schema, in a versioned YAML — switching becomes mostly a matter of pointing the runtime at a different endpoint and re-running the validation suite against the new model's output. The contract didn't change. Only the executor did. That's a different shape of dependency, and it's the shape that survives the next three model generations.

The practical upshot, for a CIO or VPE planning a two-to-three year roadmap: investments in versioned, externalized contracts are durable. Investments in prompt-craft against a specific model version are not. Both have a place. Only the first one is what you put in the architecture diagram.

Question for the comments: how many versions of "the main system prompt" live in your repo today — and which one is running in prod?

#AI #GenAI #TechLeadership #Engineering #CTO