A 4x3 audit to see where your Meta-Software layer leaks
If you have agents in production and you don't yet know where your Meta-Software layer is leaking, the 4x3 auditor below is a first cut you can run on yourself this week. It's a thermometer, not a closed diagnosis. Any red cell signals a localized control vacuum, and a control vacuum is exactly what shows up downstream as friction, rework, and the slowdown that nobody can quite account for in the metrics.
I'm offering this as a useful first cut, not as the single source of truth. If you run it for six months and find a cell is missing or an axis is extra, that information is worth more than the version I'm publishing today.
Why a Matrix and Not a Checklist
A checklist suggests the boxes are independent and that each one stands on its own. The Meta-Software boxes don't work that way. An organization can have excellent observability over agent behavior and almost no governance — and the cost of the missing governance is amplified, not mitigated, by knowing in fine detail what the agent did wrong.
A matrix forces you to think about each sub-category through the same evaluation axes. That makes comparison possible. It also makes the result legible to people who weren't in the room when you drew it up — a manager looking at the painted matrix can see at a glance where the worst red is and ask the right next question.
The point isn't a perfect score. It's to find the most expensive red and attack it first. Everything else in this post is in service of getting you to that cell.
The 4 Meta-Software Sub-Categories (Reminder)
For anyone coming in fresh, the four sub-categories from the original paper are these.
Functional Observability. What is the agent actually doing, expressed in business terms — not in tokens, not in API calls, not in latency percentiles. Did the agent close the right tickets? Did it touch the right files? Did the change it made advance the goal the team is actually trying to advance?
Structural Validation. Does the agent's output respect contracts, invariants, integrity rules? In code, that's type-checks, lint, tests, build success, schema validation. In finance, that's accounting identities and policy bounds. In any domain, it's the layer that says "this output is shaped correctly, even before we ask whether it's correct in substance."
Contextual Continuity. Does the agent know what happened before this turn, this session, this week? Is the state coherent across runs, or does the agent come into each interaction cold and re-learn the same things three times running?
Automated Governance. Are there codified policies the agent literally cannot violate without escalation? Not policies-as-documentation that everyone has read and nobody enforces. Policies as code, evaluated at the relevant decision point, with explicit escalation paths when the policy and the situation collide.
The 3 Evaluation Axes
Now the axes. Each of the four sub-categories gets evaluated on the same three dimensions.
Coverage. What percentage of the agentic output does this control actually touch? Not what percentage could it touch in principle — what percentage does it touch in production today. If you have functional observability instrumented for 30% of your agent fleet and silence on the rest, your coverage on that axis is 30%, not 100%.
Latency. Does the control run at the pace of the agent's production, or does it fall behind? An agent that ships an output in two seconds and a validation layer that checks the output in two hours is a validation layer that doesn't function as a control — it functions as a post-mortem.
False-positive / False-negative. Does the control manufacture false confidence, and does it let real things through? This is the axis people most often skip, because measuring it requires knowing the ground truth, which is hard. But it's the axis that determines whether the green cell is actually green or just performatively green.
The 4x3 Matrix With Scoring Criteria
Here is the matrix, with explicit thresholds so the scoring is reproducible rather than vibes-based.
| Sub-category | Coverage | Latency | FP / FN |
|---|---|---|---|
| Functional Observability | Green >80% / Amber 40-80% / Red <40% | Green ≤agent pace / Amber 1-3× / Red >3× | Green <5% / Amber 5-15% / Red >15% |
| Structural Validation | (same thresholds) | (same thresholds) | (same thresholds) |
| Contextual Continuity | (same thresholds) | (same thresholds) | (same thresholds) |
| Automated Governance | (same thresholds) | (same thresholds) | (same thresholds) |
The thresholds are deliberately round. They're not the right thresholds for every organization — a financial-services org probably needs Green to start at 95%, not 80% — but they're a defensible starting point that gives the matrix some teeth.
The org score is the painted matrix itself, not a single number. Any red cell is a localized control vacuum. Don't sum the matrix into a score and report "we're at 67%". The point of the matrix is that the location of the red matters more than the count of the red.
How to Apply the Auditor in a Week
If you want to actually run this, here's the rhythm I'd suggest.
Day 1-2. Inventory of agentic output. What's running, where, at what volume? You probably don't have this in one place. You'll be surprised — most orgs underestimate the spread by a factor of two or three because shadow agentic adoption is real.
Day 3. Pass each sub-category through the three axes with real data, not opinion. "We have observability" is not a coverage measurement. Pull the actual numbers. If you can't pull them, that itself is data — it means coverage is low enough that you don't even have the instrumentation to measure it.
Day 4. Paint the matrix. Sit with the team that owns the agentic systems and walk through it. The conversation is at least as valuable as the matrix — disagreements about whether a cell is amber or red usually surface assumptions the team hasn't tested.
Day 5. Identify the most expensive red. The expense is the product of coverage gap, business consequence, and speed of harm propagation. A red cell on Governance with high coverage gap and high consequence is more expensive than a red cell on Continuity that affects only a low-volume internal tool.
Week 2 onward. Closure roadmap with explicit priority. Don't try to turn the whole matrix green at once. Pick the one most expensive red and put a quarter on it. The dashboard will tell you whether the intervention worked.
Typical Patterns That Show Up
From the small sample of organizations that have run something like this, four patterns appear often enough to be worth naming.
Pattern A (the most common). Functional Observability red, everything else amber. You know the agent ran, you know the API call returned, you know the latency was acceptable. You don't know whether it did the right thing in business terms. This is the dominant pattern in mid-sized engineering orgs that adopted coding agents in 2024-2025 without rebuilding the supervisory layer around them.
Pattern B. Structural Validation green (CI, linters, tests all in place), Governance red. The agent's code compiles and passes the tests, and it also routinely violates an internal policy the team had written down but never coded as an enforceable check. The pattern shows up most in organizations with strong engineering culture and weak policy infrastructure.
Pattern C. Contextual Continuity red. Every session starts from zero, the agent makes the same mistake three times, and senior practitioners burn cycles re-teaching the agent things it learned and lost. This is often the cheapest red to close — most context layers are tractable engineering problems — but it requires that someone notice the pattern, which depends on observability that may not exist (see Pattern A).
Pattern D (the dangerous one). Everything green on coverage, FP/FN high. The matrix paints green on three of four cells, the dashboard looks healthy, and the controls are actually manufacturing false confidence. This is worse than having no control, because the org is now operating with the comfort of a green light while the underlying signal is noise. It's the pattern that produces the incidents nobody quite reconstructed in the post-mortem.
What This Matrix Does NOT Do
This isn't an absolute score comparable across organizations. Two orgs with identical matrices can have radically different actual risk profiles depending on what their agents are doing — a green matrix for an internal-only documentation agent is not the same as a green matrix for an agent that approves customer refunds.
It doesn't replace SOC 2, ISO 27001, SOX, or any other framework that has formally defined controls and external auditors. Those are doing different work, in a different time horizon, for different audiences. The 4x3 auditor lives upstream of those — it's the diagnostic the team runs to see where their own house is leaking before the auditor shows up.
It doesn't distinguish between a poorly designed control and a well-designed control that hasn't reached coverage yet. A cell can be red because the wrong thing is being measured, or because the right thing is being measured in only 20% of the surface. Those are different problems with different fixes, and the matrix flags both as red without naming which.
It's not a closed derivation. The four sub-categories are a first cut from the original paper. There might be a fifth I'm missing — agentic alignment with the human Mixer's intent, for example, is a candidate I keep going back and forth on. There might be only three real categories and the fourth I'm naming is a special case of one of them. I don't know yet.
Honest Epistemic Posture
This is offered as a useful first cut, not as a single source of truth. If you apply it for six months in your org and find that a sub-category is missing, or an axis is extra, or the thresholds are wrong for your domain — that information is more valuable than the current version of the model.
I'm interested in version two of this auditor coming from the field, not from my desk. The framework will only get sharper if the people who run it report back where it broke, where it surfaced a real red the team didn't know about, and where it generated a false alarm that wasted a quarter.
That's the posture. Useful, falsifiable, open to revision. The matrix paints today; the matrix gets corrected by what you find when you act on what it paints.
If you run the auditor in your organization this week, which cell came out red first? Data beats opinions here — comment the cell and I'll come back with analysis.
#MetaSoftware #TechLeadership #DevOps #CIO