← Back to journal
Benchmark / June 4, 2026

Agent Command Safety Benchmark: 1,200 Decisions Measured

How Termyte measures deterministic agent command safety without presenting a fixture suite as proof of complete safety.

A balanced fixture

The checked-in governance suite contains 1,200 unique actions: 400 expected allow, 400 expected warn, and 400 expected block.

termyte bench
termyte bench --json

The current checked-in result is 1,200 correct decisions, zero false-safe results, and zero overblocks. Per-decision precision and recall are reported alongside a confusion matrix and category coverage.

What it validates

The suite measures the stable, non-executing policy/check path against labeled fixtures. It covers read-only actions, tests, publishing, destructive Git operations, secret access, destructive SQL, and broad filesystem deletion.

What it cannot prove

It does not prove complete command coverage, sandbox isolation, guaranteed interception, or governance of commands that bypass Termyte. Re-run the benchmark against the installed version instead of treating a checked-in result as a permanent guarantee.