Hexagonal by rubric, not by linter: a different way to enforce architecture

Linters are binary: pass or fail. Rubrics reason. If you want to enforce architecture on LLM-generated code, you need the second one.

This is the realization that took me longest to articulate, and it's the one that ended up shaping how rlabs-agentguard evaluates output. The architectural rules that matter most in a hexagonal codebase — the ones that protect the domain, that keep the wiring clean, that make the seams maintainable — are almost all rules that a linter can't express, because they're not about the presence or absence of a token. They're about the meaning of what the code is doing.

Why Architectural Linters Break

The tools exist and they're not bad. importlinter for Python, archunit for Java, dependency-cruiser for TypeScript — they catch the obvious case. If your domain/order.py has a line that says from fastapi import APIRouter, the linter catches it. That's the easy case. That's the case that gets ten percent of the way.

What the linter doesn't catch is the version that any senior reviewer would catch in two seconds: I have a domain class that internally calls a service that does touch the framework. The domain class itself has no framework imports. It passes the linter cleanly. The service it calls — also "in domain" — imports SQLAlchemy because somebody decided that was easier than going through the port. The dependency rule is violated, transitively, and the linter has no way to see the violation because the linter is reasoning at the token level.

The second failure mode is worse. Every team ends up writing its own non-transferable rule set. Your team's importlinter config has eleven custom contracts, each one encoding a tacit rule that took an argument in a PR three years ago to nail down. The config is opaque to anyone new. It can't be shared with the team next door because their structure is slightly different. The rules don't move. The rules don't compose. The rules don't survive a stack change. You bought local enforcement at the cost of all generalization.

And then there's the issue that pushed me hardest into the rubric model: linters can't tell the LLM why something failed in a way the LLM can act on. A linter says "line 14 violates contract X". The model can rewrite line 14. The model can't reason about why contract X exists, what it's protecting, what an acceptable alternative would be. The corrective signal is too thin.

What an Architectural Rubric Looks Like

A rubric is a list of criteria that span the levels at which architecture actually breaks. It looks like this (excerpted from a hexagonal archetype YAML, as shipped in the api_backend built-in):

validation:
  checks:
    # Binary — same as linter, kept for completeness
    - id: no_framework_imports_in_domain
      level: error
      kind: binary
      rule: 'no module in domain/** imports from fastapi, sqlalchemy, requests'

    # Structural — needs to read the shape of the code
    - id: ports_are_abc
      level: error
      kind: structural
      rule: 'every class in domain/ports/ inherits from abc.ABC and has @abstractmethod'

    # Composition — needs to reason across files
    - id: container_is_single_wiring_point
      level: error
      kind: composition
      rule: 'no module outside container.py instantiates an adapter class directly'

    # Semantic — needs to understand what the code means
    - id: use_case_returns_result
      level: warning
      kind: semantic
      rule: 'every public method of a use case returns Result[T, E] or raises DomainError'

scoring_weights:
  type_safety: 0.9
  modularity: 0.8
  observability: 0.7

Look at the four kind values: binary, structural, composition, semantic. Only the first one is something a linter handles cleanly. The other three need an evaluator that can hold the whole file in mind, walk across module boundaries, and reason about meaning. That's what the rubric is designed to give an LLM evaluator the language to do.

Who Evaluates the Rubric

The evaluator can be the same LLM that generated the code, running an auto-challenge pass: a second call where the model is given its own output plus the rubric and asked, criterion by criterion, whether the output satisfies the criterion and why. The auto-challenge catches a surprising amount, because the model is good at evaluating against an explicit standard even when it failed to produce against it on the first pass. The asymmetry between generation and verification is real and worth exploiting.

For higher-stakes evaluations — and this is the architecture I've been moving toward, though I'd call it post-MVP — you can run cross-model validation: the code is generated by model A, the rubric is evaluated by model B. The two models are less likely to share the same blind spots. When they agree the rubric is satisfied, the signal is stronger than either alone. When they disagree, you have a specific, named disagreement to escalate to a human — not a vague "the model wasn't sure".

The key design decision: the rubric travels with the archetype. It doesn't live in CI. This matters because CI rules are project-local. They get added to the project's Jenkinsfile or its GitHub Actions config, they get tweaked over time, and they don't move when the project moves. The rubric, by living in the archetype, is portable across every project that uses that archetype. Demo #07 in the public agentguard-demo repo (gitea.rlabs.cl/rlabs-cl/agentguard-demo) walks through this on a FastAPI note-api built against the api_backend archetype — the same archetype that any other api_backend project would use, with the same rubric, regardless of which CI system the project happens to run.

Honest Limitations

I don't want to oversell this, because the failure modes are real and they matter for production use.

The evaluator LLM hallucinates less than when generating, but it still hallucinates. It can claim that domain/order.py doesn't import FastAPI when it does, especially if the import is buried in a nested conditional or behind a star-import. It can give a confident reading of a container.py and miss that an adapter is being instantiated indirectly through a factory. The hallucination rate on evaluation is significantly lower than on generation — verification against an explicit standard is a much narrower task than open-ended production — but it isn't zero. The rubric output has to be read with the same skepticism you'd bring to any model output on a non-trivial task.

Rubrics don't replace tests. This one matters and I want to be explicit. A rubric checks that the code has the right shape. A test checks that the code does the right thing. The shape can be perfect and the behavior can still be wrong, and a clean rubric pass on a function that returns the wrong answer is worse than no check at all because it produces false confidence. The rubric is a complement to the test suite, not a substitute for it. Anyone telling you otherwise — including me, if I get sloppy — is selling something.

You still need a human for the edge cases. The rubric catches the structural patterns. It doesn't catch the situation where the right answer is "we should violate this rule for this specific reason". Architecture has exceptions. Senior engineers know which exceptions are legitimate and which are shortcuts. The rubric doesn't know that, and asking the LLM to evaluate "is this a legitimate exception" is asking for trouble. The rubric should flag the exception. The human should decide.

The Transferable Principle

The principle generalizes well past rlabs-agentguard, and I'd argue it applies whether or not you ever use any LLM tooling at all.

If your architectural rules live only in the seniors' heads, write them as a rubric — the exercise alone earns its weight. Sit with the two or three senior engineers on your team. Ask them to list, in order, the architectural rules they enforce in code review without thinking about it. Capture them. Categorize them by kind: binary, structural, composition, semantic. Most teams discover, in the process, that half the rules they thought everyone knew are actually held by one person, and that two of the rules different seniors hold are quietly in conflict. That conversation is more valuable than the rubric itself.

Rubrics generalize across projects where linters fragment. A linter config is project-local. A rubric, declared at the archetype level, applies to every project that conforms to that archetype. When you move from project A to project B, the rules don't have to be reinvented. The portability is what makes the rubric useful at organization scale, not just at team scale.

And the one that took me longest to see: the rubric is the language an agent can operate in; the linter isn't. A linter is a closed system — its output is pass/fail, no reasoning attached, no corrective signal beyond "fix the violation". A rubric is open — each criterion carries its rule, its rationale, its level (error/warning), and the evaluator can produce a structured report that says "check X failed because the use case returns a raw dict instead of a Result", which the generating agent can act on directly. The rubric is communicative. The linter is binary. For coordinating with an agent, you need the first one.

Question for the comments: which architectural rule in your team is impossible to encode as a linter but everyone knows when it breaks?

#Architecture #SoftwareCraftsmanship #Engineering #CodeQuality #GenAI