Two Pillars Protocol — A proposed maturity model for AI-era software engineering
Software engineering has had maturity models for forty years. CMMI told organizations whether their processes were repeatable. ISO/IEC 33000 told them whether quality was measurable. Pöppelbuß & Röglinger (2011) gave the field a methodology to evaluate the maturity models themselves.
None of them speak the language of the engineer who, last Tuesday, dispatched four agentic flows in parallel, gated three of them with tests, killed one because the spec was wrong, and shipped the surviving merge inside two hours.
That engineer is doing something old (engineering) and something new (engineering with AI as a collaborator). The new part isn't "uses AI tools" — that's like measuring a programmer in 1995 by whether they used an IDE. The new part is how the work is structured when an engineer can orchestrate multiple intelligent subsystems concurrently, validate their output, and integrate it into a coherent whole.
The Two Pillars Protocol is an instrument and a maturity model for measuring exactly that: the new individual cognitive capacity, and the organizational conditions that let it consolidate. This post is the long-form explanation of what it measures, how it measures it, and what its current limits are.
The two pillars
Pillar 1 — Mixer Mode (individual). The cognitive disposition to run multiple agentic threads in parallel, evaluate their outputs simultaneously, and integrate them into coherent work. Like a music producer running stems through a console: not authoring each track from scratch, but coordinating, gating, mixing. The protocol measures three sub-dimensions:
- Multiplicity — how many distinct cognitive axes (product reasoning, architecture, code, validation) a practitioner operates on within the same working window.
- Simultaneity — how many in-flight delegations they carry at once without losing coherence.
- Integration — how tight the feedback loop is between dispatch, evaluation and re-direction.
These compose into a 0–100 Mixer Index, scored server-side from 18 items with explicit weights and inverse penalties.
Pillar 2 — Meta-Software (organizational). Software (and the organization around it) that understands, modifies, validates and orchestrates other software at the symbolic level. The capacity AI-era practitioners need to reason about systems whose internals they no longer hand-code line by line. The protocol measures four D-scores:
- D1 Cognitive — foundational skills under AI assistance (working memory under load, hypothesis testing, error attribution between human and model).
- D2 Meta-Software — the org's investment in tooling that operates on its own software.
- D3 Institutional — whether the surrounding system (processes, peers, leadership) creates conditions where Mixer Mode can develop.
- D4 Exposure — volume and variety of AI-mediated work experience. A floor capacity: without enough hours running agentic flows, the other dimensions don't consolidate.
Each D-score is 0–5. An aggregated org capability is the mean.
The two pillars compose into a 2D quadrant:
- Leading edge — high Mixer, high Meta-Software.
- Orchestrator without theory — high Mixer, low Meta-Software (individuals running fast on top of an org that won't support them).
- Theory without practice — low Mixer, high Meta-Software (the org has tooling, the people don't know how to leverage it).
- Early in the curve — low/low.
The position is the diagnosis. The dimensions are the prescription.
How the instrument works
- 78 items across 6 blocks (67 scored + 11 metadata/consent). 12–18 minutes for a single respondent. Items are mostly Likert (1–5), with multi-select for tool inventories and short numeric inputs for years of experience and hours per week.
- Server-side scoring — client-computed scores are never trusted. The same items + same weights always produce the same output.
- K-anonymity ≥ 3 per aggregate cell. No org or sub-team is reported with fewer than 3 respondents. This applies to dashboards, PDFs, and any peer comparison.
- Versioned consent with three separately revocable agreements: research use (required), follow-up interview (optional), findings notification (optional). Each is stored independently and can be withdrawn at any time.
- Token-based individual reports — no account, no email login. The respondent receives a private URL at the end of the diagnostic. The token is the access credential.
- Cycle mode for organizations — internal cohort, k-anon protected, aggregate dashboard for leadership, individual reports for respondents, PDF snapshot at cycle close.
- Pilot license open by design. The operationalization, the scoring code, and the schema are available on request and will be public when v0.5 lands.
Individual vs. organizational measurement
The most important conceptual distinction in the protocol is the one between individual perception and organizational measurement.
When a single practitioner takes the diagnostic, they answer items about themselves (Mixer Mode dimensions) and items about their org (D1–D4 dimensions). The Mixer Index they receive is a measurement — it summarizes their own behavior. The D-scores they receive are their perception — N=1 perceptual data about the surrounding org.
This matters because perceptual data from a single observer has known limits. A practitioner who is frustrated with their CTO may rate D3 (Institutional) low for reasons that have more to do with last week's standup than with structural conditions. A practitioner who is enthusiastic about new tooling may rate D2 high before the tooling has produced any results.
The protocol treats individual D-scores as what one respondent thinks, not what the org is.
The triangulated measurement comes from running a cycle: multiple respondents from the same organization (across hierarchy levels), with reports aggregated under k-anonymity. The cycle aggregate D-scores are an organizational measurement in the same sense that engagement surveys or NPS scores are — multi-observer, anonymized, statistically meaningful at the org level.
The instrument is explicit about this distinction at every touchpoint: the in-tool quick preview before final submit, the dimension card in the web report, and the glossary entry for each D-score.
This is intentional. The protocol would rather under-claim about individuals than overfit to a single self-report.
How it compares to prior maturity models
This is the section most readers want, so the short answer first: prior models were designed for problems that aren't this one.
- CMMI V3.0 (2023). Excellent at process discipline. Excellent at organizational capability. Silent on the question of how individual cognitive capacity changes when humans collaborate with intelligent tooling. CMMI tells you whether your release process is repeatable. It doesn't tell you whether your senior engineer can run four agentic threads coherently and whether that capacity is being developed or eroded by the conditions around them.
- ISO/IEC 33000 series. Strong on quality measurement and assessment methodology. Same blind spot: organizational view without the new individual capacity that AI-era work requires.
- Pöppelbuß & Röglinger (2011). The methodological backbone. Their framework — design principles for maturity models, criteria for evaluation — is the foundation we used to build the Two Pillars Protocol. P&R 2011 is methodology, not a domain model. It says how to build a good maturity model; it doesn't propose one for AI-era engineering.
- Newer AI-readiness frameworks (post-2021). A dozen or so, mostly focused on org-level AI adoption strategy (do you have an AI strategy? a data platform? executive buy-in?). They miss the workshop-floor reality of how practitioners structure their work when AI is a collaborator. And they don't measure individuals.
The crosswalk: of 18 characteristics that AI-era engineering work exhibits, prior frameworks cover 4–7 each. The Two Pillars Protocol covers 17. The remaining one is honest about a gap that hasn't been operationalized yet. The full crosswalk table — characteristic by characteristic, framework by framework — will be in the v1 paper.
What the protocol knows it doesn't know yet
A maturity model that won't say what it doesn't know is selling something. The Two Pillars Protocol is open about the empirical questions it hasn't answered:
- Reliability. Cronbach's α has not yet been computed. It requires roughly 200 respondents per functional area to evaluate credibly. The data is being collected.
- Predictive validity. There is no longitudinal evidence yet that the Mixer Index predicts anything — productivity, retention, hiring decisions, anything. That evidence takes 12–24 months of measurement against outcome variables.
- Cultural transferability. Items were designed in English. Whether they translate cleanly into other working languages — and whether functional-area context (Product Eng vs Platform vs ML Infra) dominates cultural context — is an open empirical question.
- Item bias. INVERSE items (deliberately reverse-coded to detect acquiescence) are flagged explicitly in the UI. Whether respondents respond differently to them than to direct items is not yet measured.
- Latent construct validity. Whether "Mixer Mode" is one construct or three that happen to correlate is a factor-analysis question. Pending.
These are open empirical questions. The instrument is published now to gather the data needed to answer them.
How to use it
There are three paths in, sized for different purposes.
Take the diagnostic yourself. About 15 minutes. Individual mode is free, with no account and no email login required. The output is a private report with the Mixer Index, the 4 D-score perceptions, a quadrant placement, peer comparison context, and the patterns of practice the respondent has confirmed or refuted. The private URL is the credential — bookmark it to re-access.
Run a cycle in your organization. For tech leaders (EM / Director / VP / Founder) who want to assess their team or organization systematically. The cycle is scoped 1:1 with us — cohort design, hierarchy modeling, timeline, communication. The output is a k-anon protected aggregate dashboard for leadership and individual private reports for each respondent. There's a request form on the landing; we reply within 1–2 business days. Org cycle creation is currently white-glove rather than self-serve while the cycle backend completes its v0.5 sprint.
Engage with the model. If you find a measurement issue, a missed prior framework, an item that doesn't make sense in your functional area, or a way the operationalization is wrong, we want to know. [email protected]. Dissent and refinement are part of how a maturity model earns its keep.
Where this is going
The Two Pillars Protocol is in pilot v0.4. The v0.5 sprint completes cycle-aware reporting (preview-vs-full reports tied to cycle state) and the public release of the operationalization and scoring code. The v1 paper writes up the methodology, the crosswalk vs prior models, and the initial empirical findings as the data accumulates.
The platform is operational. The model is real. The limitations are real and named.
If you got here because you're trying to figure out whether your AI-adoption framework is measuring the right thing, take the diagnostic. The 15 minutes will tell you more about your current position than most consulting reports will.
Two Pillars Protocol · pilot v0.4 · operationalization at https://assess.rlabs.cl/ · contact [email protected]