Benchmarks
Agent-task benchmark
An agent answers "How does tokio schedule and run async tasks?" with and without each tool, on the Tokio codebase, measuring efficiency and blind-judged answer quality. oxcode and codegraph were measured on different agent harnesses, so the comparable unit is each tool's improvement vs. its own no-tool baseline, not absolute numbers.
| arm | answer quality | tokens | cost | tool calls | wall time |
|---|---|---|---|---|---|
| baseline (no tool) | 0.98 | — | — | — | — |
| oxcode — codex/gpt-5.5, CLI, n=6 | 0.96 (tied) | +15% | +4% | −4% | +14% |
| oxcode — codex/gpt-5.5, MCP, n=6 | 0.93 | −74% | −57% | −84% | −60% |
| codegraph — Opus 4.8, MCP, published | not measured | −38% | even | −57% | −18% |
Percentages are change vs. that tool's own no-tool baseline (negative = reduction, better; quality is the blind LLM-judge score, 0–1). All oxcode rows come from one n=6 release suite on Tokio.
Absolute medians: tokens 395k (baseline) → 455k (CLI) → 104k (MCP); cost $0.17 → $0.18 → $0.07; tool calls 28 → 27 → 5; wall 97s → 111s → 39s.
The MCP server is the headline
Delivering the same bounded, PageRank-curated context through a one-call
oxcode_explore MCP tool — instead of a CLI the agent composes —
cuts tool calls 84%, tokens 74%, cost 57%, and wall 60% vs. the no-tool baseline,
exceeding codegraph's published reductions (−57% tool calls / −38% tokens).
The CLI arm is statistically tied with the baseline: the agent treats a shell binary as a supplement to its own grep/read, not a replacement — so the gap was always tool delivery, not index quality. The one cost the quality gate exposes (and a quality-blind benchmark would hide): MCP answer quality dips to 0.93 vs. 0.98, a completeness trade-off from the leaner exploration. codegraph numbers are from its README, re-validated 2026-06-02.
The engine-level numbers behind these results — incremental reindex, query latency, cold index — live in oxgraph Benchmarks.