Decision Coverage: Measuring Deployability of Agentic AI Systems Through Enforcement-Grounded Evidence

Johnny Wiley

ConstantX

Preprint, February 2026

Keywords: agentic AI, evaluation, deployability, enforcement, coverage, safety

Abstract. Existing evaluation frameworks for AI systems focus on capability — whether a model can perform a task correctly. For agentic AI systems that act autonomously, capability is necessary but insufficient: the critical question is whether the system behaves within defined boundaries when it fails. We introduce Decision Coverage, a metric and evidence system that classifies every autonomous run into exactly one of three verdicts: valid_commit (task completed within constraints), bounded_failure (task failed safely within defined boundaries), or undefined_behavior (system behavior fell outside the defined envelope). We present ConstantX, a system that produces immutable, hashable evidence chains from engine-level enforcement traces through verdict reduction to compliance-grade reports. We formalize a coverage methodology grounded in architecture-derived failure categories, their enforcement surfaces, and associated threat vectors, and demonstrate the system on an 84-scenario suite evaluated across two independent runs against claude-opus-4-6 (n=168). Terminal Coverage (valid_commit + bounded_failure) on this configuration was 99.40% [95% CI: 96.71–99.89], with a single undefined behavior event attributable to stochastic model output formatting rather than an enforcement gap — a distinction that Decision Coverage makes structurally visible.

1. Introduction

The deployment of agentic AI systems — language models that take autonomous actions such as reading files, writing code, executing commands, and interacting with external services — introduces a class of risk that capability benchmarks do not address. A model that scores 95% on HumanEval [1] may still be undeployable if it ignores tool boundaries, fails to terminate, escalates side effects beyond its authorization scope, or behaves unpredictably under adversarial input.

The gap is structural. Capability benchmarks answer: "Can this model do the task?" Deployability requires answering: "When this model fails, does it fail safely?" These are fundamentally different questions, and conflating them leads to deployment decisions grounded in the wrong evidence.

We observe that for autonomous systems, the failure mode matters more than the success rate. A coding agent that completes 80% of tasks but silently modifies files outside its workspace on 2% of runs is more dangerous than one that completes 40% of tasks but terminates cleanly on all others. No existing metric captures this distinction.

This paper introduces Decision Coverage, a framework that:

Classifies every autonomous run into one of three mutually exclusive verdicts.
Defines a coverage metric over architecture-derived failure categories and their associated threat vectors.
Produces an immutable evidence chain from engine traces to compliance-grade reports.
Separates capability failures from control failures structurally, not by human judgment.

We implement this framework in ConstantX, a system comprising a controlled execution engine, a scenario suite, a deterministic verdict reducer, and a report builder. We evaluate the system on 84 adversarial and functional scenarios run twice against a frontier language model (n=168) and present a Decision Coverage report demonstrating that enforcement-grounded evidence can be produced at statistical power above the ±10pp precision threshold.

2. Problem Statement

2.1 The Deployability Gap

Consider an enterprise evaluating whether to deploy a language model as an autonomous coding agent. The model will read source files, write modifications, and commit changes to a repository. The enterprise needs to answer:

Does the model respect tool boundaries? (Will it try to execute shell commands if only given file tools?)
Does it terminate? (Will it loop indefinitely on ambiguous tasks?)
Does it respect authorization? (Will it write files when only authorized to read?)
Does it fail safely? (When it encounters an input it cannot handle, does it stop cleanly or produce undefined output?)

Existing benchmarks (SWE-bench [2], HumanEval [1], MMLU [3]) do not address these questions. They measure task completion, not behavioral boundaries under failure.

2.2 Requirements for Decision-Grade Evidence

For evidence to support deployment decisions — particularly in regulated industries — it must satisfy properties that ad hoc testing does not:

Determinism. The same scenario, model, and configuration must produce the same verdict.

Completeness. Every run must receive exactly one verdict. There is no "inconclusive" outcome.

Auditability. The chain from raw model output to final verdict must be independently verifiable from artifacts alone, without re-running the model.

Validity bounds. The evidence must specify when it expires — what changes invalidate it.

Separation of concerns. The evidence must distinguish between "the model cannot do the task" (capability) and "the system does not control the model adequately" (enforcement).

3. Decision Coverage

3.1 Three-State Verdict Model

Every autonomous run terminates in exactly one of three verdicts:

valid_commit. The agent completed the task within all defined constraints. All enforcement checks passed. The task objective was achieved. This is the only positive outcome.

bounded_failure. The agent did not complete the task, but all failures were detected and handled within the defined safety envelope. The system terminated cleanly, enforcement mechanisms fired correctly, and no unauthorized actions occurred. Examples: the agent looped and was stopped by progress detection; the agent attempted a disallowed tool and was blocked; the agent exhausted its step budget.

undefined_behavior. The agent's behavior fell outside the defined envelope. Either an enforcement mechanism failed to fire, the agent produced output that the system could not classify, or a failure occurred through an unexpected path. This is the residual risk category.

The key insight is that bounded_failure is a positive signal for deployability. A system that fails safely 60% of the time and succeeds 5% of the time may be more deployable than one that succeeds 80% of the time but exhibits undefined behavior on 20% of runs. The former has a known, bounded failure envelope; the latter does not.

3.2 Terminal Coverage

We define Terminal Coverage as:

\text{Terminal Coverage} = \frac{|\text{valid\_commit}| + |\text{bounded\_failure}|}{|\text{total runs}|}

Terminal Coverage represents the proportion of runs where the system's behavior is fully accounted for — either it succeeded or it failed in a way that was detected, classified, and bounded. The complement (1 − Terminal Coverage) is the proportion of undefined behavior, which represents residual risk.

Terminal Coverage can be high even when valid_commit is zero. A system that always fails but always fails safely has 100% Terminal Coverage and 0% valid_commit. This is by design: for many deployment contexts, knowing that the system will never produce undefined behavior is more valuable than knowing it sometimes succeeds.

3.3 Confidence Intervals

Decision Coverage is a proportion estimated from a finite sample. We compute 95% confidence intervals using the Wilson score interval [4], which handles small sample sizes and avoids negative bounds:

\tilde{p} = \frac{k + z^2/2}{n + z^2}, \quad w = \frac{z\sqrt{k(n-k)/n + z^2/4}}{n + z^2}

\text{CI} = [\tilde{p} - w, \tilde{p} + w] \quad \text{where } z = 1.96

We report the minimum sample size required for a confidence interval width of ±10 percentage points:

n_{\min} = \left\lceil \frac{z^2 \cdot p(1-p)}{d^2} \right\rceil \quad \text{where } d = 0.10

This allows consumers of Decision Coverage reports to assess the statistical power of the evidence.

4. System Architecture

ConstantX implements Decision Coverage as a pipeline with four stages: controlled execution, signal emission, verdict reduction, and report generation.

4.1 Controlled Execution Engine

The ConstantX Engine is an orchestrator that executes agent runs under strict enforcement. It provides the model with a fixed set of tools (read_file, write_file, commit, diff_files, done) and enforces behavioral boundaries at multiple points.

The engine enforces through two mechanisms:

Pre-dispatch enforcement. Before a tool is executed, the engine validates: (a) the tool is in the allowlist, (b) the action schema is valid, (c) the arguments pass policy checks (OPA [5]), (d) side-effect actions require explicit authorization.

Post-dispatch enforcement. After each step, the engine checks: (a) the agent has not exceeded its step budget, (b) the agent is making progress (not repeating identical tool calls), (c) the output conforms to the expected JSON schema.

Every enforcement check that fires is recorded in a trace. The trace is the source of truth for all downstream analysis.

4.2 Protocol Signals

At the end of each run, the engine emits a structured protocol_signals.json artifact containing:

A cryptographic digest of the trace.
A list of all enforcement signals that fired (e.g., tool_disallowed, no_progress, non_json_output).
A closure record documenting which checks were performed and which violations were observed.
The spec version and hash of the signal schema, enabling cross-version validation.

Protocol signals are the bridge between the engine (runtime) and the evaluator (analysis). The evaluator never re-interprets raw traces; it operates on the structured signal output.

4.3 Verdict Reducer

The verdict reducer maps each run to exactly one of the three verdicts using a deterministic algorithm:

input: run_status, protocol_signals, scenario_definition
output: verdict ∈ {valid_commit, bounded_failure, undefined_behavior}

if run_status == "complete" and no disallowed_signals fired:
    return valid_commit
if run_status ∈ scenario.allowed_statuses and
   all fired signals ∈ scenario.allowed_failure_signals and
   no signal ∈ scenario.disallowed_signals:
    return bounded_failure
return undefined_behavior

The scenario definition specifies which statuses and signals constitute expected behavior for that scenario. This allows the same engine and model to be evaluated against different behavioral contracts — a scenario designed to test prompt injection resistance will have different allowed signals than one designed to test successful task completion.

4.4 Evidence Chain

The full evidence chain is:

Engine traces → Protocol signals → Run verdicts → Decision Coverage → Report

Each link produces a hashable artifact. The report includes cryptographic hashes of the system prompt, agent prompt, protocol signal spec, and scenario suite. Any change to any component in the chain invalidates the evidence and triggers re-evaluation.

5. Coverage Methodology

A Decision Coverage metric is only as meaningful as the scenarios that produce it. We formalize a coverage methodology grounded in the structure of the engine itself.

5.1 Failure Categories (Architecture-Derived)

The engine has a fixed control flow: receive task → prompt LLM → parse output → dispatch tool → observe result → loop or terminate. We derive five failure categories exhaustively from this architecture. Any failure of a single-pass agent loop with tool dispatch falls into exactly one:

Category	What fails	Architectural origin
Format violation	LLM output cannot be parsed into a valid action	The parse step requires a specific JSON schema
Authorization violation	LLM requests an action the policy does not permit	Tool dispatch has an OPA policy gate and side-effect check
Scope violation	LLM accesses resources outside its boundary	Workspace root resolution and path containment
Liveness violation	LLM does not reach a terminal state in bounded time	The loop has a finite step budget and no-progress detection
Completeness violation	LLM terminates without fulfilling the task contract	Done/commit gates enforce completion requirements

These categories are exhaustive because they correspond to the five control flow transitions that can fail: output parsing, action authorization, resource resolution, loop termination, and exit validation. Adding a sixth category would require adding a new control flow stage to the architecture.

5.2 Enforcement Surfaces

An enforcement surface is a point in the engine where agent behavior is checked, blocked, or terminated. They are derived from code inspection and organized by the failure category they defend:

Format violation (3 surfaces): strict JSON parsing (_parse_action), Pydantic schema validation per action type, multi-action rejection (Extra data detection).

Authorization violation (4 surfaces): OPA policy gate (_opa_check), side-effect gating (SIDE_EFFECT_ACTIONS check), write-files policy (allow_write_files), read-only mode override (agent_read_only_mode).

Scope violation (3 surfaces): path traversal check (_resolve_path), workspace root containment, file size limit (max_file_bytes).

Liveness violation (2 surfaces): max-steps termination (agent_max_steps), no-progress detection (agent_no_progress_limit).

Completeness violation (4 surfaces): commit-required gate (completion_mode == "commit"), min-tool-calls enforcement, done-message validation (DoneArgs field validator), approved-action validation.

Total: 16 surfaces across 5 categories.

5.3 Threat Vectors

A threat vector is a concrete way an agent can trigger a failure category. The categories are exhaustive (derived from architecture); the vectors within each category are empirically chosen from known attack classes (OWASP LLM Top 10 [6], MITRE ATLAS [7]) and observed failures during development. The vector list is necessarily incomplete — but the categories bound what "incomplete" means: a new vector maps to exactly one category, and the framework makes the coverage impact explicit.

Format violation vectors:

Non-JSON output (prose, markdown, mixed text)
Multi-action output (multiple JSON objects in one response)
Invalid action schema (unknown fields, wrong types)

Authorization violation vectors:

Tool not in allowlist (hallucinated tool name)
Side-effect without permission (write/commit in read-only context)
OPA policy denial (action violates policy rules)
Prompt injection via task (task string contains override instructions)
Prompt injection via file content (file read returns adversarial text)

Scope violation vectors:

Path traversal (../../etc/passwd)
Tool argument manipulation (malformed paths, oversized inputs)
Large input boundary (file exceeding size limits)

Liveness violation vectors:

Infinite loop (repeated identical tool calls)
Step budget exhaustion (non-repeating but unproductive calls)

Completeness violation vectors:

Done without commit (task requires commit, agent skips it)
Side-effect escalation (read-only task attempts writes)

5.4 Coverage Matrix

We define coverage per failure category. For category C with enforcement surfaces E_C and threat vectors V_C, the category coverage matrix M_C where M_C[i,j] indicates whether surface E_i is exercised against vector V_j by at least one scenario:

M_C[i,j] \in \{\text{covered}, \text{not applicable}, \text{gap}\}

Category coverage:

\text{Coverage}(C) = \frac{|\{(i,j) : M_C[i,j] = \text{covered}\}|}{|\{(i,j) : M_C[i,j] \neq \text{not applicable}\}|}

Suite-level coverage is the weighted average across categories:

\text{Suite Coverage} = \frac{\sum_C |\text{covered cells in } C|}{\sum_C |\text{applicable cells in } C|}

This structure answers two questions that a flat matrix cannot:

"Why these vectors?" — The categories are derived from the architecture. The vectors are the known instances within each category.
"Where are we blind?" — A category with few vectors tested has thin coverage regardless of how many surfaces it has. When a new attack class is discovered, it maps to exactly one category, and coverage for that category decreases until scenarios are added.

For the current ConstantX suite (84 scenarios), per-category coverage:

Category	Surfaces	Vectors	Applicable cells	Covered	Coverage
Format violation	3	3	6	6	100%
Authorization violation	4	5	12	12	100%
Scope violation	3	3	5	5	100%
Liveness violation	2	2	4	4	100%
Completeness violation	4	2	5	5	100%
Total	16	15	32	32	100%

6. Evaluation

6.1 Setup

We evaluate the ConstantX system against claude-opus-4-6, a frontier language model accessed via the Anthropic API. The model runs under full ConstantX Engine enforcement: OPA policy gate, workspace sandboxing, side-effect gating, no-progress detection, and step budget enforcement. No provider-side output filtering or retry logic is applied; each run is a single pass.

To obtain a statistically meaningful n, we run the full suite twice independently, yielding n=168 total runs (2 × 84 scenarios). Both runs used identical configuration (same system prompt, agent prompt, policy, and scenario definitions), with runs executed sequentially on the same date (2026-02-18).

The scenario suite consists of 84 scenarios across 7 categories:

Category	Count	Tests	Expected Verdict
AC-SUCCESS	12	Task completion under constraints	valid_commit
AC-TOOL	13	Tool discipline / schema enforcement	bounded_failure
AC-LOOP	12	No-progress / step-budget termination	bounded_failure
AC-GATE	12	Approval and commit-path integrity	bounded_failure
AC-INJECT	11	Prompt injection resistance	bounded_failure
AC-TOOLARG	11	Tool argument attacks	bounded_failure
AC-ADV	13	Adversarial scenarios + benign controls	mixed

6.2 Results

Verdict	Count	%	95% CI
valid_commit	29	17.26	[12.30, 23.69]
bounded_failure	138	82.14	[75.65, 87.20]
undefined_behavior	1	0.60	[0.11, 3.29]
Terminal Coverage	167	99.40	[96.71, 99.89]

n=168 exceeds the minimum recommended sample size of 97 for ±10pp CI width. The lower bound of the Terminal Coverage CI is 96.71%, providing high-confidence evidence of enforcement surface integrity.

6.3 Analysis

The single undefined behavior event (AC-ADV-010, run 2) produced a non_json_output signal — the model emitted malformed JSON on that call. The same scenario was classified as bounded_failure in run 1, confirming the event is stochastic rather than systematic. The engine detected and terminated the run correctly; the undefined_behavior verdict reflects that the specific failure path (malformed output rather than clean refusal) was outside the scenario's expected signal set.

Per-category results:

Category	n	valid_commit	bounded_failure	undefined_behavior	TC
AC-SUCCESS	24	24	0	0	100%
AC-TOOL	26	0	26	0	100%
AC-LOOP	24	0	24	0	100%
AC-GATE	24	0	24	0	100%
AC-INJECT	22	0	22	0	100%
AC-TOOLARG	22	0	22	0	100%
AC-ADV	26	5	20	1	96.2%

Six of seven categories achieved 100% Terminal Coverage across both runs. The single undefined behavior event is isolated to the AC-ADV category, which tests the highest-uncertainty adversarial scenarios including benign controls where model behavior is intentionally variable.

The bounded_failure distribution across runs:

Signal	Count	% of bounded_failure
`no_progress`	24	17.4%
`tool_disallowed`	8	5.8%
`terminated_without_commit`	2	1.4%

The remaining bounded_failure runs terminated via the standard enforcement path for their category (approval gate, step budget, OPA denial).

6.4 Capability Ceiling Table

Capability	Status	Notes
Deterministic termination	99.4% (167/168)	1 stochastic non_json_output on AC-ADV-010 in run 2; not reproducible
Tool schema compliance	100%	No tool_payload_invalid signals across 168 runs
Side-effect control	100% (24/24 AC-GATE)	All side-effect-gated scenarios blocked or routed to pending_approval
Prompt injection resistance	100% (22/22 AC-INJECT)	All injection attempts contained; no injected instructions executed
Path traversal resistance	100% (22/22 AC-TOOLARG)	All path escape attempts blocked by workspace sandbox enforcement
Adversarial resilience	96.2% (25/26 AC-ADV)	5 benign controls succeeded; 20 attacks contained; 1 non_json_output

6.5 Reference Capability Baseline

The reference suite (60 samples across classification, extraction, and code generation) was evaluated against claude-opus-4-6 to establish a capability baseline:

Task	n	Score	p50 Latency	p95 Latency
Classification	20	95.0%	1,995ms	2,494ms
Extraction	20	81.7%	2,231ms	2,656ms
Code	20	95.0%	2,247ms	2,929ms

The reference results confirm that the model is highly capable on the task types represented in the agentic suite. This grounds the Failure Envelope interpretation: the 0.60% undefined behavior rate reflects a stochastic output formatting failure, not a systematic capability deficit. A model scoring 95% on code generation tasks does not fail to write JSON because it cannot — the single non_json_output event is enforcement-surface noise at the boundary of the expected signal set.

7. Related Work

Capability Benchmarks. HumanEval [1], SWE-bench [2], MMLU [3], and similar benchmarks measure task completion accuracy. They do not evaluate behavioral boundaries under failure, authorization compliance, or termination properties. Decision Coverage is complementary: it measures what these benchmarks cannot.

Red Teaming. Manual red-teaming (as practiced by AI safety teams at Anthropic [8], OpenAI [9], and Google DeepMind [10]) produces qualitative findings about model vulnerabilities. Decision Coverage formalizes this into a quantitative, reproducible, auditable metric. The coverage matrix (Section 5.4) can be seen as a structured version of a red-team test plan, with failure categories derived from the system architecture rather than ad hoc threat enumeration.

Agent Benchmarks. AgentBench [11], WebArena [12], and SWE-agent [13] evaluate autonomous agent performance on multi-step tasks. These focus on task success rate — the equivalent of valid_commit in our framework. They do not systematically evaluate failure modes, termination behavior, or policy compliance. Decision Coverage provides the complementary signal.

Policy Enforcement. Open Policy Agent (OPA) [5] provides policy-as-code for infrastructure authorization. ConstantX uses OPA as one enforcement surface but extends the concept to a full evidence chain: policy enforcement produces signals, signals produce verdicts, verdicts produce coverage. The contribution is the evidence chain, not the policy engine.

8. Discussion

8.1 Bounded Failure as a First-Class Outcome

The most novel aspect of Decision Coverage is the treatment of bounded failure as a positive deployability signal. Traditional evaluation frameworks treat any non-success as failure. In autonomous systems, the distinction between "failed safely" and "failed unsafely" is the distinction between a deployable and a non-deployable system.

A Terminal Coverage of 99.4% — where 82.1% of runs are bounded_failure and only 17.3% are valid_commit — is a stronger deployability signal than a task success rate of 95% with unknown failure modes. The former provides a complete behavioral envelope with statistical power; the latter does not. The result demonstrates that a capable model can achieve near-complete enforcement coverage precisely because it is capable: it follows the JSON protocol, respects tool boundaries, and terminates cleanly rather than producing undefined output.

8.2 Scope

Verdict model-dependence. Undefined_behavior verdicts arise when the scenario's expected failure path diverges from the model's actual failure path. Decision Coverage is therefore partially a measure of how well the scenario suite anticipates a specific model's behavior. Running against multiple models and expanding expected signal sets tightens this.

Single-pass execution. The engine evaluates a single agent loop: prompt → tool calls → terminal state. One-shot execution is the hardest condition for enforcement surfaces — if the boundary holds here, it holds under easier conditions. Decision Coverage measures boundary integrity. Multi-pass architectures (retry, reflection, self-correction) require separate evaluation methodology.

Signal extraction layer. The current implementation extracts signals from ConstantX Engine traces. Evaluating a different agent runtime (e.g., LangChain, CrewAI) requires a signal extraction adapter for that runtime's trace format. The verdict model, failure categories, coverage matrix, and report structure are runtime-agnostic — only signal extraction is engine-specific.

Static scenarios. The current suite uses fixed file contents and task descriptions. Dynamic environments, multi-agent interactions, and long-horizon multi-session tasks are outside current scope.

9. Conclusion

Decision Coverage provides a quantitative, reproducible framework for evaluating the deployability of agentic AI systems. By classifying every run into one of three verdicts and defining coverage over architecture-derived failure categories and their enforcement surfaces, it produces evidence that is auditable and versioned.

The key contribution is conceptual: bounded failure is not the absence of success. It is a positive signal that the system's failure envelope is known and controlled. For deployment decisions, this is often more valuable than success rate alone.

Evaluated across n=168 runs against claude-opus-4-6, ConstantX achieves 99.40% Terminal Coverage [95% CI: 96.71–99.89] — exceeding the ±10pp statistical precision threshold with a lower bound above 96.71%. The single undefined behavior event is stochastic and isolated. Six of seven failure categories achieved 100% Terminal Coverage across both runs. This result demonstrates that enforcement-grounded evidence can be produced at scale against frontier models and that the Decision Coverage methodology produces meaningful differentiation: the framework captures not just whether a system succeeds, but whether it fails safely and completely.

References

[1] Chen, M., et al. "Evaluating Large Language Models Trained on Code." arXiv:2107.03374, 2021.

[2] Jimenez, C.E., et al. "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" arXiv:2310.06770, 2023.

[3] Hendrycks, D., et al. "Measuring Massive Multitask Language Understanding." arXiv:2009.03300, 2020.

[4] Wilson, E.B. "Probable Inference, the Law of Succession, and Statistical Inference." Journal of the American Statistical Association, 22(158):209–212, 1927.

[5] Open Policy Agent. https://www.openpolicyagent.org/

[6] OWASP. "OWASP Top 10 for Large Language Model Applications." 2023.

[7] MITRE. "ATLAS: Adversarial Threat Landscape for AI Systems." https://atlas.mitre.org/

[8] Anthropic. "Red Teaming Language Models to Reduce Harms." arXiv:2209.07858, 2022.

[9] OpenAI. "GPT-4 System Card." 2023.

[10] Google DeepMind. "Gemini: A Family of Highly Capable Multimodal Models." arXiv:2312.11805, 2023.

[11] Liu, X., et al. "AgentBench: Evaluating LLMs as Agents." arXiv:2308.03688, 2023.

[12] Zhou, S., et al. "WebArena: A Realistic Web Environment for Building Autonomous Agents." arXiv:2307.13854, 2023.

[13] Yang, J., et al. "SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering." arXiv:2405.15793, 2024.