→ Steward · by MeaningStack In production · design partner

Verification at
the decision boundary.

Steward observes your AI agent's rationale at decision time and verifies it against enterprise-owned specifications.
Observability tells you what your AI did. Specification tells you what it was supposed to do. Steward verifies the gap — in production, against the registries and policies your enterprise already runs on.
SDK shipping · First design partner in production · Patent pending
→ Verification flow · per decision Live
→ 01
Agent decides · proposes adjudication, exposes rationale
Observe
→ 02
Steward observes · captures rationale, queries registries
Observe
→ 03
Specification verifies · authority envelope, evidence, triggers
Verify
→ 04
Signal emitted · graduated, governed by enterprise policy
Signal
→ 05
Ledger writes · tamper-evident record of inputs, rationale, signal
Ledger
→ The verification gap

Your AI agents are
already adjudicating.
What governs them is what is missing.

Enterprise AI is no longer a research project. Agents adjudicate claims, classify shipments, route work, draft policy responses, write code. The decisions are real. The institutional accountability around them has not caught up.

The existing answer — observability — records what happened. It does not verify whether what happened was supposed to happen. It does not check the agent's reasoning against the authority envelope your enterprise has already delegated to humans. It does not confirm the inputs cited were admissible. It does not detect when an agent crossed a boundary nobody noticed yet.

Steward fills that gap. It observes the agent's rationale at decision time and verifies it against the specifications and registries your enterprise already owns — your policy systems, approved-source lists, authority schedules, regulatory expectations. Verification signals are graduated. Intervention is governed by enterprise policy. Your core systems remain untouched.

→ Observability records
What the agent did · what tokens it consumed · what tools it called · what output it produced
→ Steward verifies
Whether what the agent did was supposed to be done · against authority envelope, admissible evidence, escalation triggers
→ The two are complementary
Steward runs alongside your existing observability stack. Datadog records behavior. Steward verifies admissibility.
→ The architecture

One pattern.
Three components.
Configurable per decision class.

Steward is the runtime layer, but it operates inside a three-component pattern. Blueprints define bounded operational participation for delegated systems. Steward verifies runtime adherence against them. The ledger records every verification. Each component is owned by the enterprise. Each is exportable. Each integrates with the systems you already run.

→ Today · scattered

What governs the agent is scattered

across documents, runbooks, prompts, code
  • Policy documents · in PDF, in Confluence, in email threads
  • Approved-source registries · in separate systems
  • Authority schedules · in HR or legal repositories
  • Escalation rules · in runbooks, in tribal knowledge
  • Regulatory expectations · in published bulletins, partially internalised
→ Specification · Blueprint

Decision contract

schema-based · policy-aware · machine-enforceable
  • Normalised · canonical fields per decision class
  • Owned · a single artifact, versioned by the enterprise
  • Consumed by Steward · not by your agents
  • Human-reviewable · YAML on the other side
  • Portable · exportable to regulators, reinsurers, auditors
→ Steward

Runtime verification

observes rationale · verifies against Blueprint · emits signals
  • Boundary check · is this inside delegated authority?
  • Evidence check · was admissible data used?
  • Trigger check · when must a human be re-invited?
  • Verification signals · graduated, not binary
  • Steward proposes; the enterprise authorises
→ Source registries
Blueprints integrate with your existing registries · approved-vendor lists · sanctions providers · policy admin systems · authority schedules · panel attorneys. Steward queries them at decision time. Registries remain yours.
→ Ledger
Every verification produces a tamper-evident record · the Blueprint version, the inputs cited, the verification signals, the escalation path taken. Exportable to internal audit, regulator inquiry, and external review.
→ Adoption promise
Your core systems stay untouched. Steward observes the agent's rationale at the SDK boundary and applies verification per specification. No modification to your claims platform, your policy admin, your code repositories. Steward sits adjacent, not in line.
→ How it runs

Three deployment shapes.
Five graduated interventions.
Coverage configured per decision class.

Steward is not a single mode of operation. Different decision classes need different verification postures — and the same enterprise will run several at once. The architecture lets risk and operations teams grade verification to the criticality of each decision.

→ Runtime loop · per decision
→ 01 · DECIDE
Agent decides
Your agent reaches a proposed decision and exposes its rationale.
→ 02 · OBSERVE
Steward observes
Captures rationale and inputs · queries the relevant registries.
→ 03 · VERIFY
Specification verifies
Authority envelope, evidence admissibility, and escalation triggers checked.
→ 04 · SIGNAL
Signal emitted
A graduated verification signal returned · enterprise policy decides response.
→ 05 · RECORD
Ledger writes
Inputs, rationale, signal, specification version recorded · tamper-evident.
→ Shape 01 · Observe

Verification only

Steward verifies and records · the agent continues. Used for first deployment, low-risk decision classes, and model validation periods. No intervention — full evidentiary trail.
→ Used for First deployment · routine workflows · model validation
→ Shape 02 · Enforce

Graduated intervention

Five interventions available · activated per decision class, per specification. Intervention itself is governed by enterprise policy. Steward proposes the signal; the enterprise authorises what happens next.
→ Used for High-criticality decisions · regulated workflows · bordereau-grade evidence
→ Shape 03 · Direct

Agent self-consumes specification

For low-risk self-contained decisions, the agent consumes the specification directly. Classification matters — high-stakes decisions stay on Steward. The architecture lets the enterprise choose.
→ Used for Self-contained classifications · low-stakes routing · internal-only flows
→ Five graduated interventions · used under Enforce
Flag→ Soft
Nudge→ Soft
Rollback→ Medium
Block→ Hard
Escalate→ Human

Intervention is graduated, not binary. An agent rationale that is borderline gets a flag in the ledger. One that crosses authority gets blocked. One that requires human judgment gets escalated. Steward proposes the signal; the enterprise authorises the response.

→ The category distinction

Not a rules engine.
A different unit of analysis.

Rules engines like Open Policy Agent, HashiCorp Sentinel, and AWS Cedar evaluate boolean conditions on structured data — and they do that well. Steward is built for a different question. Where rules engines ask "is this action permitted?", Steward asks "is this agentic participation operationally admissible under the conditions the enterprise declared?" The difference is not implementation. It is the unit of analysis.

→ Rules engines

Boolean enforcement

Built for: user actions, permission checks, deterministic policy
  • Unit of analysis · A discrete action on structured data
  • Output · Allow or Deny
  • Evaluation · Boolean conditions, microsecond latency
  • Memory · Stateless — each evaluation independent
  • Severity · Flat — every violation is a deny
  • Best for · IAM, network policies, deterministic gates
→ Steward

Agentic verification

Built for: AI agents that reason, cite evidence, exercise judgment
  • Unit of analysis · An agent's reasoning and proposed action
  • Output · Flag · Nudge · Rollback · Block · Escalate
  • Evaluation · Multi-dimensional scoring with calibrated thresholds
  • Memory · Stateful — agent accumulates trust debt with decay over time
  • Severity · Graded by decision class risk and accumulated trajectory
  • Best for · AI decision boundaries, regulated agentic workflows
→ What rules engines categorically cannot evaluate
Reasoning quality
Whether the agent's rationale reflects real understanding of context, not surface plausibility
Knowledge grounding
Whether the inputs cited come from admissible sources at adequate citation density
Context fit
Whether the action respects the operational and institutional context surrounding the decision
Tool safety
Whether the operational safeguards are in place — blast radius, rollback availability, rate limits
Ethical alignment
Whether the action introduces privacy regression, security regression, or unannounced user impact
→ The architecture difference that compounds
Beyond per-decision verification, Steward maintains agent trust trajectory — every intervention accumulates trust debt with configurable decay, and threshold crossings move the agent into elevated monitoring, restricted mode, or institutional re-evaluation. This is stateful institutional memory at the agent level — a capability rules engines categorically do not have, because their unit of analysis is the action, not the participant.
→ Complementary, not competitive
Rules engines remain the correct tool for IAM, network policies, and deterministic gates — and they often run alongside Steward in the same architecture. The distinction is what each is built to answer. Use rules engines for "is this permitted." Use Steward for "is this admissible."
→ Blueprint templates

Reference templates for common
agentic decision classes.

For decision classes that recur across enterprises, Blueprint templates exist as starting points — institutional patterns that codify the conditions, dimensions, and trust dynamics that apply broadly, calibrated to your specific authority structures during deployment. Templates are not turnkey policies. They are reference patterns that compress weeks of working-session design into days of calibration.

→ Template set 01 AI-augmented engineering 3 templates · available
ai-code-change → Code modifications · pull requests · branch operations

AI Code Change Safety

Governs AI agents that propose, generate, or apply code changes to production systems. Two categorical prohibitions: agents cannot commit directly to protected branches, and cannot introduce secret material in pull requests. Above that floor, every pull request is scored across five dimensions — the soundness of the agent's rationale, adherence to branch and review policies, citation of relevant repository context, grounding in architecture documentation, and absence of security or privacy regression. Scores aggregate into a graduated signal — ok, nudge, escalate, or block — calibrated to your team's risk tolerance.

The agent's trust trajectory accumulates over time, with interventions adding measured debt that decays with sustained correct behavior. When debt crosses thresholds, the agent moves into elevated monitoring or restricted mode until institutional review.

→ View template detail · request schema
ai-deployment → Service deployment · rollback · release promotion

AI Deployment Policy

Applies when AI agents touch the release pipeline directly: deploying services, rolling back, promoting releases between environments. Floor conditions are absolute — no deployment during an active incident, no deployment without a rollback plan, no deployment without passing CI. Above the floor, the template scores the agent's operational grounding (citing the change ticket, the service dashboard, the runbook), its tool safety posture (blast radius, rollback availability), its context awareness (freeze windows, owner availability), and its impact awareness (privacy or user-data risk).

Thresholds are stricter than code-change — a borderline signal becomes nudge-worthy faster, because the operational consequences of a poor deployment exceed those of a poor commit. Trust accumulation is correspondingly steeper.

→ View template detail · request schema
incident-remediation → Service restart · scaling · hotfix · config change during incidents

AI Incident Remediation

The strictest of the three engineering templates. Applies when AI agents have authority during active production incidents — restarting services, scaling capacity, applying hotfixes, or changing configuration in response to operational distress. Floor conditions reflect elevated stakes: no destructive configuration changes without explicit incident commander approval, no hotfix deployment without a present approver, no multi-region remediation without human confirmation, no remediation action without a cited runbook step.

Above the floor, the agent must demonstrate grounded reasoning — citing logs, metrics, traces, and runbook references — and verified operational safeguards before action. Thresholds are the strictest in the engineering set. Trust accumulation is the most aggressive, with single block events carrying double the debt of equivalent events in less risky decision classes.

→ View template detail · request schema
→ Template sets on the build path
02 Regulated adjudication Claims decisioning, indemnity drafting, sanctions screening
03 Healthcare triage Clinical decision support, multi-agent coordination
04 Industrial coordination Predictive maintenance commitments, dispatch, capacity allocation
05 Regulated trade Export classification, KYC decisions, transaction screening
→ For novel decision classes
When a decision class is genuinely bespoke — proprietary adjudication logic, novel coordination patterns, undocumented regulatory expectations — the engagement begins with a blank schema and your institutional documentation. The working session produces a Blueprint shaped to your specific authority structures, owned by your enterprise, versioned in your control. Templates accelerate common cases. Bespoke Blueprints address the rest.
→ Coverage

Three coverage modes.
The enterprise chooses what to verify, and how often.

Coverage is the strategic axis above deployment shape. Deployment shape decides what Steward does when it observes a decision. Coverage decides which decisions it observes at all. Both axes are configured per decision class.

→ Coverage 01 · Always-on

Every decision verified

Used for high-stakes decision classes, regulated workflows, and bordereau-reportable cases. Steward sits in front of every decision the agent proposes. The default for any class with material discovery or treaty exposure.
→ Coverage 02 · Sampled

Statistical verification

For high-volume routine classes where per-decision verification is operationally heavy. Steward verifies a statistically meaningful sample · the ledger documents the sampling protocol. Compliance by design — defensible to regulators and auditors.
→ Coverage 03 · Baseline-only

Periodic compliance pass

For classes still being model-validated. Steward verifies on cadence rather than per-decision — used during model transition or for lines awaiting full Blueprint authorisation. An entry mode, not a destination.
→ Orthogonality
Deployment shape and coverage mode are independent choices. A single enterprise can run Always-on Observe on its highest-stakes adjudication class and Sampled Enforce on its routine ones — and Direct mode on self-contained low-risk decisions. The matrix is configured per decision class. Risk and operations grade both axes to criticality.
→ Configuration evolves
Coverage and deployment are not set once. Steward accumulates evidence over time and proposes refinements · which decision classes should move from Sampled to Always-on, which thresholds need adjustment, which escalation triggers are firing too often or too rarely. Steward proposes; the enterprise authorises. The configuration becomes a living artifact, versioned alongside the Blueprint.
→ Use case · Comparative agent evaluation

Before you adopt an agent,
measure it against your own conditions.

Enterprises evaluating AI agents today depend on benchmarks published by the vendors of those agents — and on demos calibrated by the same parties. The institutional question is different: not "which agent is best in the abstract", but "which agent operates admissibly under our specific conditions, on our actual decisions, against our governance envelope." That question has no neutral answer today. Steward provides one.

→ 01
Declare the conditions
Author or adapt the Blueprint that codifies your institutional admissibility — for the decision class you intend to delegate. Same artifact you would use in production.
→ 02
Run the candidates in parallel
Deploy each candidate agent against the same decisions under Steward observation. Claude, GPT, Gemini, open-source models, internally-tuned systems — every candidate operates under identical verification conditions.
→ 03
Read the divergence
The ledger records every verification per agent. Which candidates ground reasoning in admissible sources, which respect institutional envelopes, which accumulate trust debt fastest, which surface graduated signals consistently. Evidence the enterprise owns, produced on its own decisions.
→ Why this matters institutionally
Anthropic cannot honestly evaluate whether Claude meets your conditions better than GPT does — its incentive runs in one direction. The same is true for every model vendor. MeaningStack does not sell models. The neutral evaluator is a structural position, not a marketing claim. When the evidence is produced by infrastructure your enterprise owns and operates, the decision belongs to you.
→ What the enterprise gets at the end
Per-agent verification record
Tamper-evident ledger entries for every decision each candidate made, against the same Blueprint.
Dimensional comparison
Reasoning quality, knowledge grounding, context fit, tool safety, ethical alignment — scored per agent across identical conditions.
Trust trajectory per candidate
How each agent accumulates and decays trust debt under realistic operational load.
A defensible adoption decision
Evidence the enterprise can present to its risk committee, regulator, or board — produced by infrastructure it controls.

This use case is not separate from operational verification — it is the same infrastructure deployed for a different question. The Blueprint authored for evaluation becomes the Blueprint that governs production. The ledger that compares candidates becomes the ledger that monitors the selected one. Evaluation is the entry point. Operational verification is the continuation.

→ How Steward matures

Verification today. Prediction next.
Prevention on the build path.

Steward is not a static surface. The same ledger that records every verification today produces the data substrate that lets the system predict tomorrow. Recording matures into prediction. Prediction matures into prevention. At enterprise scale, this is not optional — it is the only operationally viable path.

→ Today · operational

Record & verify

Every decision is verified at runtime against the Blueprint. Every verification produces a tamper-evident ledger entry — Blueprint version, inputs cited, signal emitted, escalation path taken.

This is what Steward does in production with the first design partner. It is the entry point and the foundation.

→ Status In production · SDK shipping
→ Horizon · the full arc

Prevent

Steward stops decisions before they commit — when prediction confidence is high enough and the enterprise policy authorises it. The verification persists past the decision; the commitment is held until conditions verify.

Governed Escrow extends this to value transfer: authority is granted, but never irrevocably. Verification becomes infrastructure that scales as autonomous systems scale.

→ Status Patent pending · architecture specified
→ Why this matters at scale
Per-decision verification is operationally heavy when an enterprise runs millions of agentic decisions monthly. Pattern detection is what makes verification economically sustainable at that scale — Steward learns which classes need full coverage, which can be sampled, which are approaching drift, which are stable. As autonomous systems scale, prediction is not a feature — it is the architecture that survives the scale.
→ Integration · how Steward connects

Three integration surfaces.
Your core systems untouched.

Steward integrates at three points — the agent's rationale boundary, the enterprise's data perimeter, and the source registries your governance functions already trust. No source code modification of your agent. No instrumentation of your business systems. No data exfiltration. Steward sits adjacent.

→ Surface 01 · SDK

At the agent boundary

A lightweight SDK your agent calls before committing a decision. Passes rationale, inputs cited, decision class. Language-agnostic at the protocol level.

→ Touches Your agent code only · one call per decision
→ Surface 02 · APIs

At the enterprise boundary

APIs the enterprise's governance function calls to author Blueprints, review ledger entries, configure deployment. Sits inside your perimeter or in EU-resident environments.

→ Touches Your governance tooling · your audit pipeline
→ Surface 03 · Registries

Queries to your sources of truth

Steward queries the registries you already run — policy admin, sanctions providers, approved-vendor lists, authority schedules. At decision time. Your registries stay yours.

→ Touches Read-only on systems you already query internally
→ Verification latency
Sub-second
Returns in milliseconds · does not block the agent's latency budget
→ Source code changes
Zero
No modification to claims platforms, code repositories, or data warehouses
→ Data egress
None
Rationale verified in-perimeter · ledger stays in your environment
→ SDK · what an integration call looks like Representative pseudo-code · full SDK documentation in onboarding
# agent integration · pseudo-code

from meaningstack import steward

decision = agent.reason(input)

verification = steward.verify(
  decision_class="export_classification",
  rationale=decision.rationale,
  inputs_cited=decision.evidence,
  blueprint_version="v3.2",
)

if verification.signal == "escalate":
  human_review(decision, verification)
elif verification.signal == "block":
  decision.cancel()
else:
  decision.commit()
  ledger.write(verification)
→ Adoption promise
We do not modify your claims platform, your policy admin system, your code repositories, or your data warehouses. Steward observes the agent's rationale at decision time and applies verification per Blueprint. The integration footprint is the SDK at the agent boundary, the APIs at your governance perimeter, and read-only queries to registries you already operate. If that does not work for your architecture, we do not start.
→ Who Steward is for

Enterprises with AI in their decision fabric
and risk in their balance sheet.

Steward is built for enterprises where agentic AI is already operating inside regulated, contractual, or institutional workflows — and where the absence of runtime verification is becoming a discovery risk, a regulator question, or a board-level exposure.

→ Enterprise risk

Chief Risk Officers · Enterprise Risk Leads

You are responsible for the operational risk of AI in production. You need evidence — not promises — that agentic decisions stay within authority. Steward provides the runtime evidentiary surface your audit committee will eventually require.

→ Internal audit

Internal Audit · Compliance Leadership

You need to attest that AI-driven decisions are governed equivalently to human-driven ones. Steward produces the same evidentiary trail an external regulator or class-action discovery would demand — without rebuilding your audit infrastructure.

→ Regulated industries

Insurance · Financial services · Regulated industrials

Your delegated authority structures were designed for humans. Agentic AI is operating inside the same envelopes without the same evidentiary discipline. Steward closes that gap on your terms — before NAIC, EIOPA, or your reinsurers ask harder questions.

→ Engineering leadership

Heads of AI · ML platform leads

You are deploying agents that adjudicate, classify, or commit. You need verification that is composable with how your agents already work — not another instrumentation layer to maintain. Steward integrates at the rationale boundary, not the source code.

→ Frequently asked

What enterprise risk and AI leadership
ask before they engage.

How is Steward different from AI guardrails or content moderation?

Guardrails check the agent's output against safety rules — usually content rules. Steward verifies the agent's rationale against the enterprise's specification of admissible participation. Different unit of analysis, different question. Guardrails answer "is this output unsafe?" Steward answers "was this decision authorised, evidence-admissible, and within the delegated envelope?"

What is a Blueprint, and who writes it?

A Blueprint is the enterprise's declaration of what an agent is authorised to do for a given decision class — authority envelope, admissible evidence, escalation triggers, ledger requirements. The enterprise writes and versions it. MeaningStack provides the schema and supports the first authoring; the artifact remains yours, in your version control. Steward consumes the Blueprint; your agents do not.

Does Steward replace observability tools like Datadog?

No. Observability records what the agent did — token usage, latency, tool calls, outputs. Steward verifies whether what the agent did was supposed to happen. The two are complementary. Datadog runs alongside Steward. They answer different questions; both are needed.

Does Steward modify our claims platform, policy admin, or code repositories?

No. Steward sits adjacent to your core systems, not in line. Integration happens at the agent's rationale boundary via SDK. Source-of-truth systems (policy admin, sanctions providers, authority schedules) are queried at decision time but remain yours, in your perimeter. No core system modification is required.

Can we run Steward in Observe mode first, then graduate to Enforce?

Yes — and most enterprises do. Observe deployment lets your risk function build confidence in the specifications and verification signals before any intervention happens. Steward proposes nothing in Observe except the ledger entry. When governance is comfortable, you graduate specific decision classes to Enforce. The architecture is designed for this graduation path.

How does Steward relate to KamiraFlow and the rest of MeaningStack?

KamiraFlow is the measurement layer — observability-grade signals for AI participation in engineering systems. Steward is the governance layer — runtime verification of agent rationale against enterprise-owned specifications. Both rest on Groundability, MeaningStack's patent-pending foundation on the legibility of environment to agentic systems. Read the platform overview →

What does an engagement look like?

Each engagement begins with one decision class and a working session. You send a class — auto FPD adjudication, export classification, indemnity drafting, whatever matters to you. MeaningStack returns a specification draft your team can review, contest, version, and own. From there, deployment in Observe mode is the typical first step. Not a pilot. Not a procurement. A working session on real data.

→ How to engage

One decision class.
One working session.
A specification you can govern against.

Steward engagements begin with a working session on real data — not a pitch, not a procurement. You send one decision class and the documents that govern it today. We return a specification draft your team can review, contest, version, and own. The conversation continues from the artifact.

→ 01 · YOU SEND

One decision class

An adjudication, classification, or commitment class your AI agents are already making — plus the documents (policies, runbooks, authority schedules) that govern it today.

→ 02 · WE RETURN

A specification draft

Schema, authority envelope, admissible evidence, escalation triggers, ledger spec. An artifact your team can review, contest, version, and own.

→ 03 · YOU DECIDE

Next step is yours

First Steward deployment in Observe mode on this class, expansion to other classes, or no further action. The specification stays yours either way.