Benchmarks

Agent-task benchmark

An agent answers "How does tokio schedule and run async tasks?" with and without each tool, on the Tokio codebase, measuring efficiency and blind-judged answer quality. oxcode and codegraph were measured on different agent harnesses, so the comparable unit is each tool's improvement vs. its own no-tool baseline, not absolute numbers.

arm	answer quality	tokens	cost	tool calls	wall time
baseline (no tool)	0.98	—	—	—	—
oxcode — codex/gpt-5.5, CLI, n=6	0.96 (tied)	+15%	+4%	−4%	+14%
oxcode — codex/gpt-5.5, MCP, n=6	0.93	−74%	−57%	−84%	−60%
codegraph — Opus 4.8, MCP, published	not measured	−38%	even	−57%	−18%

Percentages are change vs. that tool's own no-tool baseline (negative = reduction, better; quality is the blind LLM-judge score, 0–1). All oxcode rows come from one n=6 release suite on Tokio.

Absolute medians: tokens 395k (baseline) → 455k (CLI) → 104k (MCP); cost $0.17 → $0.18 → $0.07; tool calls 28 → 27 → 5; wall 97s → 111s → 39s.

The MCP server is the headline

Delivering the same bounded, PageRank-curated context through a one-call oxcode_explore MCP tool — instead of a CLI the agent composes — cuts tool calls 84%, tokens 74%, cost 57%, and wall 60% vs. the no-tool baseline, exceeding codegraph's published reductions (−57% tool calls / −38% tokens).

The CLI arm is statistically tied with the baseline: the agent treats a shell binary as a supplement to its own grep/read, not a replacement — so the gap was always tool delivery, not index quality. The one cost the quality gate exposes (and a quality-blind benchmark would hide): MCP answer quality dips to 0.93 vs. 0.98, a completeness trade-off from the leaner exploration. codegraph numbers are from its README, re-validated 2026-06-02.

The engine-level numbers behind these results — incremental reindex, query latency, cold index — live in oxgraph Benchmarks.