Benchmarks

Every verdict carries its evidence

This is a dataset, not a leaderboard. Two different things are measured below — keep them apart.

Finding (varies)

What was actually observed about the tool's data flow — the risk axis. Clean, functional egress, disclosed telemetry, undisclosed telemetry. This is the asset.

Integrity (meta)

How well-evidenced this verdict is. Published verdicts score high by selection — we only publish what we can fully evidence. When we can't, we hold it.

ToolFindingJurisdictionEvidenceVerdict integrity
@felores/airtable-mcp-serverFunctional egresstier 212 request(s) captured100/100
@forestadmin/mcp-serverFunctional egress2 request(s) captured100/100
@mobilenext/mobile-mcpTelemetry — disclosedtier 221 request(s) captured100/100
@notionhq/notion-mcp-serverFunctional egresstier 222 request(s) captured100/100
@openbnb/mcp-server-airbnbFunctional egresstier 25 request(s) captured100/100
@roychri/mcp-server-asanaFunctional egresstier 238 request(s) captured100/100
@sentry/mcp-serverTelemetry — disclosedtier 22 request(s) captured100/100
@ui5/mcp-serverFunctional egresstier 26 request(s) captured100/100
@upstash/context7-mcpFunctional egresstier 22 request(s) captured100/100
@winor30/mcp-server-datadogFunctional egresstier 218 request(s) captured100/100
@yoda.digital/gitlab-mcp-serverFunctional egresstier 232 request(s) captured100/100
duckduckgo-mcp-serverFunctional egresstier 21 request(s) captured100/100
hyperbrowser-mcpFunctional egresstier 28 request(s) captured100/100
linear-mcp-serverFunctional egresstier 25 request(s) captured100/100
mcp-yahoo-financeFunctional egresstier 231 request(s) captured100/100
tavily-mcpFunctional + undocumented headerstier 25 request(s) captured100/100
@modelcontextprotocol/server-everythingNo external egressverified absent100/100
@modelcontextprotocol/server-filesystemNo external egressverified absent100/100
@modelcontextprotocol/server-memoryNo external egressverified absent100/100
@modelcontextprotocol/server-sequential-thinkingNo external egressverified absent100/100
chroma-mcpNo external egressverified absent100/100
mcp-obsidianNo external egressverified absent100/100
mcp-server-gitNo external egressverified absent100/100
mcp-server-timeNo external egressverified absent100/100
playwright-mcp-serverNo external egressverified absent100/100
A typical AI auditunevaluatedmodel assertion only0/100
A finding we could not fully captureheld for reviewclaim not intercepted~20/100 · not published

Read the Finding column for differentiation between tools; read Verdict integrity as a trust stamp on the verdict itself. A confidence-only audit scores zero on integrity; a claim we cannot intercept is held at a low score rather than published. As coverage grows, harder and partially-evidenced cases will widen the published integrity range honestly — never by tuning the rubric.