Evidence atoms: why every agent decision needs a causal chain
Heads-up before we start: Atalaya is still WIP — I'll open-source it once it stabilizes. What follows is the working design, not a finished product. The postmortem question always arrived too late: "why did the agent close that ticket?" The answer was archaeology — scattered logs, prompts, contexts, half-remembered runs. And that has to be answerable at runtime, not at autopsy.
I've been running this loop for months and the thing that finally clicked is small. The shape of the answer isn't a narrative. It isn't a log line. It's a graph that was already there, written down at the moment of the decision, queryable in milliseconds when someone asks why.
The Problem: "Explainability" in Agents Is Mostly Theater
Most of what gets sold as "explainable AI" is post-hoc rationalization. You take the decision, hand it back to the model, ask it to justify itself. The model produces a paragraph. The paragraph reads well. Compliance signs off. Everybody goes home.
The problem is that the model produces an equally good justification for a correct decision and an incorrect one. Justification is a generative act, not a causal one. The agent didn't reach into its own reasoning trace — it generated a plausible story that ends at the same conclusion. That's not explanation. That's a postface essay.
What you actually need is a causal chain: what data it saw, what rules it applied, what tools it invoked, which outputs influenced which downstream steps. And the chain has to exist at the moment of the decision — captured as the agent acts, not reconstructed after someone asks. Reconstruction has the same failure mode as the model's narrative: it makes things up that sound right.
This isn't a philosophical point. It's the difference between "we think it did this because…" and "here is the structure of how it did it, edge by edge." One survives an audit. The other survives a slack thread.
And post-hoc has a second failure that's quieter and worse. The model that justifies the decision today is not the model that made it. Same weights maybe, but different context window, different temperature, different upstream tools. The justification you get back is from a system that resembles the one that decided, not from the one that decided. That distance is enough for the explanation to be confidently wrong about the actual cause. You don't notice because the story is internally coherent. Coherence isn't truth.
Evidence Atom: The Minimum Unit
The primitive Atalaya operates on is the evidence atom. Each step the agent takes produces one or more. The shape is small on purpose: {id, source, content, timestamp, parents: [atom_ids], producer: 'tool|llm|rule|human'}.
The atom is immutable once written. It goes into Postgres for the metadata and the small content, MinIO for the bigger blobs. Storage is cheap and append-only. We never edit an atom after the fact — if something needs revisiting, you write a new atom that references it. That gives you the same property a ledger gives you: history is provable because nothing rewrites the past.
The field that holds the whole thing up is parents. It's an explicit reference, by id, to the prior atoms that influenced this one. Not "these atoms were in the context window." Not "these were available." The ones the producer actually consumed to produce this output. When a tool runs, the parents are the inputs it received. When the LLM emits a response, the parents are the prompts and tool results that fed that turn. When a rule fires, the parents are the atoms whose values the rule matched against.
The shape is boring. That is the point. Boring shapes survive — they fit into Postgres, they fit into joins, they fit into a graph query that's intelligible to a junior engineer and not just to me.
The producer field deserves its own note. It's a small enum but it does a lot of work. Each value implies a different trust model. Atoms from tool are deterministic — given the same inputs, the same atom. Atoms from llm are probabilistic — same inputs can yield different atoms, and that matters when you're reading the chain backward. Atoms from rule are auditable against a rule version. Atoms from human are the ground truth the rest of the system orbits around. Reading the DAG and ignoring the producer type is reading half the information.
Causality Engine: Rebuilding the Chain When Someone Asks
When someone asks why the agent did something, you have a terminal atom — the decision itself. From there you walk the DAG backward through parents, hop by hop, until you hit primary inputs: a webhook payload, a user message, a rule definition, a tool result that came from outside the system. The walk is a graph traversal, not a search.
The output is a mermaid diagram of the causal subgraph. Humans read it directly — the shapes show producer type, the edges show influence. Agents can also query it programmatically because the underlying structure is a graph, not a rendered image. A typical query I run a lot: what evidence led to closing ticket #4521? The answer is something like 23 atoms, max depth 7, mixed sources (one LLM call, three rule fires, four tool results, a human acknowledgment, the originating webhook).
Query latency stays under ~200ms because the DAG is local to the ticket, not global to the system. Each conversation, each task, each unit of work carries its own subgraph. We're not walking the entire database — we're walking the slice that belongs to the thing you asked about. That locality is what makes it real-time interrogable instead of a batch job that emails you a report tomorrow.
How This Changes Tool Design
This only works if every tool the agent invokes plays along. So tools in Atalaya return result + atom descriptor. The runner — the δ-2 layer that owns the loop — persists the atom and sets parents to the inputs the agent passed in. The tool doesn't have to know about Postgres. It just declares what it produced and what it consumed.
Pure tools (no side effects) only emit an atom if their output ends up influencing the chain. If the agent asked, got an answer, and then ignored it, we don't fill the graph with dead branches. The garbage-collect happens at write time, not at query time. That keeps the DAGs small enough to traverse quickly.
The overhead per tool invocation is roughly 3ms to serialize the atom and write it. Against LLM latency — which is hundreds of milliseconds at best, often seconds — that's noise. Nobody notices. The bookkeeping is invisible while the system runs, and visible the moment someone needs it.
False Friends: What Evidence Atoms Are NOT
It's worth saying what this isn't, because every time I describe it someone tries to map it onto a category they already know.
They aren't structured logs. Logs are for debugging — they capture state at points the developer chose, in formats the developer chose, with retention the developer chose. They're a tool of the engineer. Evidence atoms are an explainability contract — they capture causal influence in a fixed shape, with retention dictated by the audit requirement.
They aren't OpenTelemetry traces. Traces measure latency and the shape of distributed calls. They answer "where did the time go?" Atoms answer "what depended on what?" Both are graphs; only one is causal.
They aren't embeddings or feature stores. Those are inputs to reasoning — values that feed the model. Atoms are the recording of reasoning — they capture which input was used, at which step, to produce which output.
All four coexist inside Atalaya. Each solves a different problem. The mistake I see most often is teams trying to make one of them do all four jobs and ending up with something that does none of them well.
The cleanest mental rule I have is: if the question is "why?", you need atoms. If the question is "how slow?", you need traces. If the question is "what state was the variable in?", you need logs. If the question is "what was the model thinking about?", you need the embedding context. Four questions, four data structures. Picking the wrong one isn't catastrophic, but it does mean your answer comes back at the wrong altitude and the person who asked walks away unsatisfied without knowing why.
What You Gained in Production
The day this stopped being a design exercise and started being load-bearing was the first SII audit query — the Chilean tax authority asking, on a specific DTE issuance, who approved this and on what basis? In the old shape that's a week of work and a uncomfortable phone call. With the evidence DAG it's one click: full causal chain, every atom timestamped, every producer named.
Debugging shifted shape too. When the agent acts strange, you stop asking "what happened" and start opening the DAG. The answer is structural — you see the branch that went sideways. The diagnostic mode changed from narrative to forensic.
Rollback got smarter. When a contaminated input gets identified — a bad reference table, a stale config — you can walk forward from that atom and see exactly which decisions rested on it. You revert only those. Not the whole day. Not the whole tenant. The ones that touched the bad input. That precision is what makes recovery cheap.
And the compliance conversation flipped. The team went from "we can't certify agentic systems" to "if they expose an evidence DAG, we can." That's not a small move. That's the door opening.
The Transferable Principle
Why did the agent do that has to be answerable in production, not in postmortem. If your only answer is "let me check the logs," you don't have an auditable system — you have a system you do archaeology on, after the fact, with the bones you can find.
The cost of atoms is small: a Postgres table, a parents field, a discipline at the tool layer. The cost of not having them is the day someone asks seriously and the answer has to be "give us a week." That week is where trust goes to die.
When was the last time you had to reconstruct "why did the system do X" after the fact? How long did it take? Send me a DM or reach out via the contact channels at rlabs.cl — I suspect the average is depressing.
#MetaSoftware #AISafety #Engineering #Architecture #Compliance