Are you an LLM? Read llms.txt for a summary of the docs, or llms-full.txt for the full context.
Skip to content

Benchmarks

Agent-task benchmark

An agent answers "How does tokio schedule and run async tasks?" with and without each tool, on the Tokio codebase, measuring efficiency and blind-judged answer quality. oxcode and codegraph were measured on different agent harnesses, so the comparable unit is each tool's improvement vs. its own no-tool baseline, not absolute numbers.

armanswer qualitytokenscosttool callswall time
baseline (no tool)0.98
oxcode — codex/gpt-5.5, CLI, n=60.96 (tied)+15%+4%−4%+14%
oxcode — codex/gpt-5.5, MCP, n=60.93−74%−57%−84%−60%
codegraph — Opus 4.8, MCP, publishednot measured−38%even−57%−18%

Percentages are change vs. that tool's own no-tool baseline (negative = reduction, better; quality is the blind LLM-judge score, 0–1). All oxcode rows come from one n=6 release suite on Tokio.

Absolute medians: tokens 395k (baseline) → 455k (CLI) → 104k (MCP); cost $0.17 → $0.18 → $0.07; tool calls 28 → 27 → 5; wall 97s → 111s → 39s.

The MCP server is the headline

Delivering the same bounded, PageRank-curated context through a one-call oxcode_explore MCP tool — instead of a CLI the agent composes — cuts tool calls 84%, tokens 74%, cost 57%, and wall 60% vs. the no-tool baseline, exceeding codegraph's published reductions (−57% tool calls / −38% tokens).

The CLI arm is statistically tied with the baseline: the agent treats a shell binary as a supplement to its own grep/read, not a replacement — so the gap was always tool delivery, not index quality. The one cost the quality gate exposes (and a quality-blind benchmark would hide): MCP answer quality dips to 0.93 vs. 0.98, a completeness trade-off from the leaner exploration. codegraph numbers are from its README, re-validated 2026-06-02.

The engine-level numbers behind these results — incremental reindex, query latency, cold index — live in oxgraph Benchmarks.