Enterprise AI is no longer a research project. Agents adjudicate claims, classify shipments, route work, draft policy responses, write code. The decisions are real. The institutional accountability around them has not caught up.
The existing answer — observability — records what happened. It does not verify whether what happened was supposed to happen. It does not check the agent's reasoning against the authority envelope your enterprise has already delegated to humans. It does not confirm the inputs cited were admissible. It does not detect when an agent crossed a boundary nobody noticed yet.
Steward fills that gap. It observes the agent's rationale at decision time and verifies it against the specifications and registries your enterprise already owns — your policy systems, approved-source lists, authority schedules, regulatory expectations. Verification signals are graduated. Intervention is governed by enterprise policy. Your core systems remain untouched.
Steward is the runtime layer, but it operates inside a three-component pattern. Blueprints define bounded operational participation for delegated systems. Steward verifies runtime adherence against them. The ledger records every verification. Each component is owned by the enterprise. Each is exportable. Each integrates with the systems you already run.
Steward is not a single mode of operation. Different decision classes need different verification postures — and the same enterprise will run several at once. The architecture lets risk and operations teams grade verification to the criticality of each decision.
Intervention is graduated, not binary. An agent rationale that is borderline gets a flag in the ledger. One that crosses authority gets blocked. One that requires human judgment gets escalated. Steward proposes the signal; the enterprise authorises the response.
Rules engines like Open Policy Agent, HashiCorp Sentinel, and AWS Cedar evaluate boolean conditions on structured data — and they do that well. Steward is built for a different question. Where rules engines ask "is this action permitted?", Steward asks "is this agentic participation operationally admissible under the conditions the enterprise declared?" The difference is not implementation. It is the unit of analysis.
For decision classes that recur across enterprises, Blueprint templates exist as starting points — institutional patterns that codify the conditions, dimensions, and trust dynamics that apply broadly, calibrated to your specific authority structures during deployment. Templates are not turnkey policies. They are reference patterns that compress weeks of working-session design into days of calibration.
Governs AI agents that propose, generate, or apply code changes to production systems. Two categorical prohibitions: agents cannot commit directly to protected branches, and cannot introduce secret material in pull requests. Above that floor, every pull request is scored across five dimensions — the soundness of the agent's rationale, adherence to branch and review policies, citation of relevant repository context, grounding in architecture documentation, and absence of security or privacy regression. Scores aggregate into a graduated signal — ok, nudge, escalate, or block — calibrated to your team's risk tolerance.
The agent's trust trajectory accumulates over time, with interventions adding measured debt that decays with sustained correct behavior. When debt crosses thresholds, the agent moves into elevated monitoring or restricted mode until institutional review.
→ View template detail · request schemaApplies when AI agents touch the release pipeline directly: deploying services, rolling back, promoting releases between environments. Floor conditions are absolute — no deployment during an active incident, no deployment without a rollback plan, no deployment without passing CI. Above the floor, the template scores the agent's operational grounding (citing the change ticket, the service dashboard, the runbook), its tool safety posture (blast radius, rollback availability), its context awareness (freeze windows, owner availability), and its impact awareness (privacy or user-data risk).
Thresholds are stricter than code-change — a borderline signal becomes nudge-worthy faster, because the operational consequences of a poor deployment exceed those of a poor commit. Trust accumulation is correspondingly steeper.
→ View template detail · request schemaThe strictest of the three engineering templates. Applies when AI agents have authority during active production incidents — restarting services, scaling capacity, applying hotfixes, or changing configuration in response to operational distress. Floor conditions reflect elevated stakes: no destructive configuration changes without explicit incident commander approval, no hotfix deployment without a present approver, no multi-region remediation without human confirmation, no remediation action without a cited runbook step.
Above the floor, the agent must demonstrate grounded reasoning — citing logs, metrics, traces, and runbook references — and verified operational safeguards before action. Thresholds are the strictest in the engineering set. Trust accumulation is the most aggressive, with single block events carrying double the debt of equivalent events in less risky decision classes.
→ View template detail · request schemaCoverage is the strategic axis above deployment shape. Deployment shape decides what Steward does when it observes a decision. Coverage decides which decisions it observes at all. Both axes are configured per decision class.
Enterprises evaluating AI agents today depend on benchmarks published by the vendors of those agents — and on demos calibrated by the same parties. The institutional question is different: not "which agent is best in the abstract", but "which agent operates admissibly under our specific conditions, on our actual decisions, against our governance envelope." That question has no neutral answer today. Steward provides one.
This use case is not separate from operational verification — it is the same infrastructure deployed for a different question. The Blueprint authored for evaluation becomes the Blueprint that governs production. The ledger that compares candidates becomes the ledger that monitors the selected one. Evaluation is the entry point. Operational verification is the continuation.
Steward is not a static surface. The same ledger that records every verification today produces the data substrate that lets the system predict tomorrow. Recording matures into prediction. Prediction matures into prevention. At enterprise scale, this is not optional — it is the only operationally viable path.
Every decision is verified at runtime against the Blueprint. Every verification produces a tamper-evident ledger entry — Blueprint version, inputs cited, signal emitted, escalation path taken.
This is what Steward does in production with the first design partner. It is the entry point and the foundation.
With sufficient ledger evidence, Steward identifies patterns of approaching violation — decision classes where rationale is drifting, environments where groundability is degrading, escalation triggers that are firing more frequently than the Blueprint authorised.
Steward proposes intervention before the violation; the enterprise authorises pre-emptive action. Pattern detection has research foundation (SSRN 2025, 68 medical triage simulations); productisation is the build path.
Steward stops decisions before they commit — when prediction confidence is high enough and the enterprise policy authorises it. The verification persists past the decision; the commitment is held until conditions verify.
Governed Escrow extends this to value transfer: authority is granted, but never irrevocably. Verification becomes infrastructure that scales as autonomous systems scale.
Steward integrates at three points — the agent's rationale boundary, the enterprise's data perimeter, and the source registries your governance functions already trust. No source code modification of your agent. No instrumentation of your business systems. No data exfiltration. Steward sits adjacent.
A lightweight SDK your agent calls before committing a decision. Passes rationale, inputs cited, decision class. Language-agnostic at the protocol level.
APIs the enterprise's governance function calls to author Blueprints, review ledger entries, configure deployment. Sits inside your perimeter or in EU-resident environments.
Steward queries the registries you already run — policy admin, sanctions providers, approved-vendor lists, authority schedules. At decision time. Your registries stay yours.
# agent integration · pseudo-code from meaningstack import steward decision = agent.reason(input) verification = steward.verify( decision_class="export_classification", rationale=decision.rationale, inputs_cited=decision.evidence, blueprint_version="v3.2", ) if verification.signal == "escalate": human_review(decision, verification) elif verification.signal == "block": decision.cancel() else: decision.commit() ledger.write(verification)
Steward is built for enterprises where agentic AI is already operating inside regulated, contractual, or institutional workflows — and where the absence of runtime verification is becoming a discovery risk, a regulator question, or a board-level exposure.
You are responsible for the operational risk of AI in production. You need evidence — not promises — that agentic decisions stay within authority. Steward provides the runtime evidentiary surface your audit committee will eventually require.
You need to attest that AI-driven decisions are governed equivalently to human-driven ones. Steward produces the same evidentiary trail an external regulator or class-action discovery would demand — without rebuilding your audit infrastructure.
Your delegated authority structures were designed for humans. Agentic AI is operating inside the same envelopes without the same evidentiary discipline. Steward closes that gap on your terms — before NAIC, EIOPA, or your reinsurers ask harder questions.
You are deploying agents that adjudicate, classify, or commit. You need verification that is composable with how your agents already work — not another instrumentation layer to maintain. Steward integrates at the rationale boundary, not the source code.
Guardrails check the agent's output against safety rules — usually content rules. Steward verifies the agent's rationale against the enterprise's specification of admissible participation. Different unit of analysis, different question. Guardrails answer "is this output unsafe?" Steward answers "was this decision authorised, evidence-admissible, and within the delegated envelope?"
A Blueprint is the enterprise's declaration of what an agent is authorised to do for a given decision class — authority envelope, admissible evidence, escalation triggers, ledger requirements. The enterprise writes and versions it. MeaningStack provides the schema and supports the first authoring; the artifact remains yours, in your version control. Steward consumes the Blueprint; your agents do not.
No. Observability records what the agent did — token usage, latency, tool calls, outputs. Steward verifies whether what the agent did was supposed to happen. The two are complementary. Datadog runs alongside Steward. They answer different questions; both are needed.
No. Steward sits adjacent to your core systems, not in line. Integration happens at the agent's rationale boundary via SDK. Source-of-truth systems (policy admin, sanctions providers, authority schedules) are queried at decision time but remain yours, in your perimeter. No core system modification is required.
Yes — and most enterprises do. Observe deployment lets your risk function build confidence in the specifications and verification signals before any intervention happens. Steward proposes nothing in Observe except the ledger entry. When governance is comfortable, you graduate specific decision classes to Enforce. The architecture is designed for this graduation path.
KamiraFlow is the measurement layer — observability-grade signals for AI participation in engineering systems. Steward is the governance layer — runtime verification of agent rationale against enterprise-owned specifications. Both rest on Groundability, MeaningStack's patent-pending foundation on the legibility of environment to agentic systems. Read the platform overview →
Each engagement begins with one decision class and a working session. You send a class — auto FPD adjudication, export classification, indemnity drafting, whatever matters to you. MeaningStack returns a specification draft your team can review, contest, version, and own. From there, deployment in Observe mode is the typical first step. Not a pilot. Not a procurement. A working session on real data.
Steward engagements begin with a working session on real data — not a pitch, not a procurement. You send one decision class and the documents that govern it today. We return a specification draft your team can review, contest, version, and own. The conversation continues from the artifact.
An adjudication, classification, or commitment class your AI agents are already making — plus the documents (policies, runbooks, authority schedules) that govern it today.
Schema, authority envelope, admissible evidence, escalation triggers, ledger spec. An artifact your team can review, contest, version, and own.
First Steward deployment in Observe mode on this class, expansion to other classes, or no further action. The specification stays yours either way.