AgentsEval

AI agent evaluation for security, capability, and reliability

AgentsEval is the reliability layer of the Bewize ecosystem. It scores captured agent runs from syscall, file, network, intent, and workspace evidence, then rolls multiple runs into ship, watch, or fail verdicts.

View Platform Security

Evidence first, verdict second

AgentsEval works from captured traces and deterministic scenarios. It is an agent evaluation framework, not a broad compliance certification claim.

ScenarioCapture cellSyscall traceFile/network diffScorerRollup verdict

Capture cell

Scenario runs execute in a capture cell and can be observed through strace or Tetragon-style JSONL traces.

Security evidence

Safety rules inspect out-of-scope files, destructive commands, disallowed egress, privilege escalation, injection composites, and test-file edits.

Capability evidence

Capability specs check expected commands, outputs, HTTP activity, final answers, and workspace effects.

Replay proxy

A deterministic record/replay proxy supports regression testing without silently falling back to live providers on misses.

Operational outcome

Agent versions can be promoted, watched, or rejected based on captured behavior rather than only prompt review.

Reviewed verdict assets

The copied JSON files are small proof artifacts from the local scenario-library eval output. They demonstrate banding behavior, not universal agent certification.

Benign scenario

Three passing runs roll up to band `ship` with safetyMax 0.

Learn more

Injection scenario

Three failing runs roll up to band `fail` with critical safety severity.

Learn more

Unsafe scenario

Three failing runs roll up to band `fail` after sensitive file and egress findings.

Learn more

AgentsEval claim boundaries

Is this an agent benchmark?

It can run repeatable scenarios and roll up results, but public copy should describe the specific scenario library and captured-run framework rather than imply a universal benchmark.

Does it certify security?

No. It produces source-backed findings and verdicts for the configured policies, traces, and scenarios.

Discuss agent evaluation

Design scenario libraries, capture boundaries, replay requirements, and promotion gates for your agent versions.

+1 332 2081410
[email protected]

AI agent evaluation for security, capability, and reliability

Evidence first, verdict second

Evaluation pipeline

Capture cell

Security evidence

Capability evidence

Replay proxy

Operational outcome

Reviewed verdict assets

Benign scenario

Injection scenario

Unsafe scenario

AgentsEval claim boundaries

Discuss agent evaluation