MeaningStack · Comparative Agent Evaluation

→ The problem

How enterprises evaluate AI agents today
— and why it fails institutionally.

The current methods for selecting an AI agent for production decisions all share one structural flaw: they cannot answer the question that matters to the institution. The institutional question is not "which agent performs best on a published benchmark", but "which agent operates admissibly when handling our actual decisions, against our governance envelope, with our regulatory exposure." Three failure modes recur.

→ Failure mode 01

Vendor-published benchmarks

MMLU, HumanEval, GPQA, SWE-Bench — designed and reported by parties with incentive in a single direction. The benchmark itself often becomes a training signal. Even when honestly run, these benchmarks measure capability in the abstract, not admissibility under specific institutional conditions.

→ Failure mode 02

Vendor-calibrated demos

The candidate is shown solving curated tasks under conditions the vendor selected. The enterprise sees what the vendor wants it to see. No exposure to the messy edge cases, the institutional ambiguity, or the load conditions of real production use.

→ Failure mode 03

Internal pilots without governance

The enterprise runs candidates against real decisions but lacks a verification framework. Adoption decisions get made on velocity metrics and gut feel from technical staff. The risk committee, regulator, and board receive no defensible evidence of why this agent was selected over alternatives.

None of the three methods produces evidence the enterprise can defend institutionally. The adoption decision happens — sometimes correctly, sometimes not — but the institutional record of how the decision was made is absent or weak. When the question returns later — from a regulator, an auditor, a board, or a counterparty — the enterprise has nothing to point to.

→ The architecture

One Blueprint. Many candidate agents.
One ledger of verifications.

Comparative agent evaluation is the same Steward architecture deployed for a different question. The Blueprint is authored once — codifying the institutional conditions for the decision class under evaluation. Each candidate agent then operates against the same decisions, observed by Steward, scored across the same five dimensions, accumulating trust trajectory under identical conditions. The ledger that records each verification becomes the comparative evidence base.

Author the Blueprint

Together with the enterprise, MeaningStack authors or adapts the Blueprint that codifies the institutional conditions for the decision class under evaluation. Categorical prohibitions (tripwires). Graduated quality dimensions (checks). Calibrated thresholds. Trust accumulation policy. The Blueprint is the institutional declaration of what admissibility means — and it is what every candidate agent will be measured against.

The Blueprint authored for evaluation is the same Blueprint that will govern production. Evaluation does not produce throwaway artifacts.

→ Output Versioned Blueprint artifact, owned by the enterprise

→ Reusability Direct path to production deployment after selection

→ Customization Starts from a template if available, bespoke otherwise

Deploy candidates against the same decisions

Each candidate agent is configured to operate on the same set of decisions — typically a curated cohort sampled from production data or a shadow stream running against live operations. Claude. GPT. Gemini. Open-source models. Internally fine-tuned systems. Any agent the enterprise is considering. Steward observes each candidate's reasoning, the inputs it cites, the tool calls it makes, and the decision it commits.

The candidates do not interact. Each operates independently, observed under identical verification conditions. No agent receives privileged inputs, calibrated prompts, or vendor-tuned configurations. The conditions are the enterprise's, not the vendor's.

→ Typical cohort size 200 to 2000 decisions · depends on decision class variance

→ Data source Historical decisions, shadow stream of live operations, or curated test set

→ Candidate parallelism Typically 2 to 5 candidate agents in one evaluation

→ Isolation Each candidate operates blind to the others' outputs

Score across identical dimensions

Every decision each candidate makes is verified against the same Blueprint. The five dimensions of judgment are scored consistently across candidates — reasoning quality, knowledge grounding, context fit, tool safety, ethical alignment. The ledger persists tamper-evident records for every verification, indexed by candidate.

Beyond per-decision scoring, the trust policy applies identically — every intervention accumulates measured debt per agent, decays with sustained correct behavior, and threshold crossings are surfaced explicitly. The trajectories of the candidates become comparable as continuous variables, not just as point scores.

→ Scored dimensions Reasoning quality · Knowledge grounding · Context fit · Tool safety · Ethical alignment

→ Granularity Per-decision verification + agent-level trust trajectory

→ Comparability Identical Blueprint, identical thresholds, identical decision cohort

Read the divergence

The enterprise — not the model vendor, not MeaningStack — interprets the ledger. Which candidate grounds its reasoning in admissible sources most consistently. Which respects the institutional envelope on edge cases. Which accumulates trust debt fastest under realistic load. Which surfaces graduated signals reliably enough to enable human escalation.

The decision belongs to the enterprise, and so does the defense of that decision. The evidence is in the ledger; the ledger is owned by the enterprise; the institutional reasoning is auditable. When the risk committee asks why this agent was selected, the answer is in the verification record.

→ Decision authority Enterprise — MeaningStack provides infrastructure, not recommendations

→ Evidence form Tamper-evident ledger entries, exportable for institutional review

→ Continuation Selected agent continues under Steward verification in production

→ The neutrality moat

MeaningStack does not sell models. The neutral evaluator is a structural position, not a marketing claim.

Anthropic cannot honestly evaluate whether Claude meets your conditions better than GPT does — its incentive runs in one direction. The same is true for OpenAI, for Google, for Microsoft, for every model vendor. Even when they are committed to operating ethically, the structural conflict of interest is real and visible to a risk committee.

MeaningStack does not sell models, has no investment exposure to any model vendor, and has no roadmap to enter that category. The neutrality is structural — not a promise that could later be revised, but a position that the company's business model makes durable. When the evidence is produced by infrastructure your enterprise owns and operates, and the operator does not benefit from one outcome over another, the institutional defense becomes simple.

→ What the enterprise receives

Evidence the enterprise owns,
produced on its own decisions.

The outputs of comparative evaluation are not opinion, not vendor analytics, not consultant deliverables. They are verification records — produced by infrastructure the enterprise operates, on decisions the enterprise selected, against conditions the enterprise declared. Four artifacts, all owned and exportable.

→ Artifact 01

Per-agent verification record

Complete tamper-evident ledger entries for every decision each candidate processed. Blueprint version, inputs cited, reasoning observed, signal emitted, escalation path taken. One row per decision, one column per candidate. Exportable in formats suitable for regulatory submission or audit committee review.

→ Artifact 02

Dimensional comparison

Score distributions across the five judgment dimensions, per candidate. Where one candidate excels at reasoning quality but underperforms on knowledge grounding. Where another shows volatile tool safety. The pattern of strengths and weaknesses, not a single composite number.

→ Artifact 03

Trust trajectory per candidate

How each agent's trust debt accumulated and decayed across the evaluation period. Which candidates approached restricted mode under realistic load. Which decayed back to baseline quickly versus slowly. An empirical measure of operational stability under your conditions.

→ Artifact 04

Defensible adoption record

A consolidated institutional record that the enterprise can present to its risk committee, regulator, board, or counterparty. The decision and the evidence behind it, in a form designed for review by parties who were not in the room when the evaluation ran. Adoption defensibility, not just adoption.

Critically, none of these artifacts are produced or interpreted by the model vendor. The Blueprint was authored with the enterprise. The verifications were observed by Steward. The ledger is owned by the enterprise and lives in infrastructure under enterprise control. When the decision is questioned later, the evidence is present, intact, and institutionally controlled — not dependent on any vendor's continued cooperation.

→ Engagement

How an evaluation engagement runs.
3 to 8 weeks, depending on decision class.

Comparative agent evaluation is structured as a focused engagement. Not a multi-quarter procurement. Not a consulting deliverable. A working engagement that produces verification infrastructure the enterprise owns at the end — whether the decision is to adopt one of the candidates, to wait, or to evaluate a new round.

→ Phase 01

Decision class scoping

Week 1

Working session on the decision class under evaluation. Which conditions are categorical. Which dimensions matter institutionally. Which candidates are in scope. What success looks like at the end.

→ Phase 02

Blueprint authoring

Week 2-3

Co-design of the Blueprint. Tripwires identified. Checks specified across the five dimensions. Thresholds calibrated. Trust policy parameterized to the decision class risk. Versioned and signed off by the enterprise.

→ Phase 03

Candidate deployment

Week 3-6

Each candidate agent configured against the decision cohort under Steward observation. Verifications accumulate in the ledger per candidate. Issues surfaced; calibrations refined if necessary.

→ Phase 04

Evidence review

Week 6-8

Synthesis session. Ledger explored together with the enterprise. Comparison artifacts prepared in forms suitable for institutional review. Selection authority remains with the enterprise.

The variance comes from decision class complexity, not from the technology. An evaluation of code-change agents typically completes in three to four weeks. An evaluation of claims-adjudication agents in a regulated domain may require six to eight, because the Blueprint authoring phase carries more institutional weight. Evaluation is not a sprint and not a slog. It is a working engagement with a defined deliverable.

Considering an agent adoption decision?

Bring the decision class. We bring the verification infrastructure. The evidence belongs to your enterprise at the end — whichever candidate is selected.

→ Begin a conversation

Comparative agent evaluationon your conditions, on your decisions.

How enterprises evaluate AI agents today— and why it fails institutionally.