Less context,
better answers.

Token Savior replaces raw file reads with structural queries. Measured across 60 real coding tasks against baseline Claude Code tools.

Claude Sonnet 4 · April 2026 · tsbench v1.0 · token-savior v2.5.1

Chars injected

56%

Accuracy

0 / 0

Won / Lost

Impossible w/o TS

Best improvement

TASK-043: Heavy Read

Read all functions from 6 large files and summarize architecture patterns.

Baseline (Run A)

Chars: 196,882
Time: 159.2s
Calls: 13 Reads
Score: 1/2

Token Savior (Run B)

Chars: 16,416
Time: 42.9s
Calls: 6 queries
Score: 2/2

Baseline

197K chars

159.2s

16K chars

42.9s

Key comparisons

Where Token Savior wins

Chars injected into contextcumulative across 60 tasks

Baseline

1,431,624

1.43M

234,805

235K

Score (accuracy)out of 120 possible points

Baseline

67 / 120

56%

115 / 120

96%

Total turnsLLM round-trips needed

Baseline

733

435

How it works

Three steps, no magic

Step 1

Index

Token Savior parses your codebase into a structural graph of functions, classes, and dependencies. One call to switch_project.

Step 2

Query

The agent queries specific symbols instead of reading entire files. get_function_source, get_dependents, get_call_chain.

Step 3

Result

84% fewer characters but better-structured information. The signal-to-noise ratio improves, and accuracy jumps from 56% to 96%.

Breakdown

Results by category

Click a row to expand details.

Category	N	Score A	Score B	Delta	Wall A	Wall B

Transparency

Honest limits

active_tokens +29% cumulés. The 84% chars reduction is offset by the MCP schema cache_creation cost. TS still wins active on 28/59 tasks (heavy_read, audit, impact, navigation), but loses on micro-tasks where the fixed schema cost dominates.

call_chain category +88% wall time. Structural get_call_chain queries are slower per-call than Grep heuristics. Score still climbs from 5/8 to 8/8, so the trade-off is correctness over latency.

Simple localisation +24% wall. Single-symbol lookups pay MCP round-trip latency: switch_project + find_symbol ~6s vs one Grep at ~3s. Score still improves (1.17 → 1.83) thanks to fewer false negatives.

0 net score regressions. TS wins 32 tasks, ties 28, loses 0. 21 tasks are flat-out impossible without TS (score 0/2 baseline → ≥1/2 with TS), spanning config, infra, audit, debug, cross-language.

Open benchmark

Want to contribute?

tsbench is open and reproducible. python generate.py --seed 42 gives the same 60-task project every time.

Run it on your agent:

Codex CLI
Gemini CLI
Hermes Agent
OpenClaw
Cursor, Continue, Cline...

If you run the benchmark on another agent and want to submit results, open a PR or issue. We'll add your results to the leaderboard.

Submit results → Read the methodology →

Try Token Savior

Structural code navigation for Claude Code.
Less context, better answers.

Install Token Savior View Benchmark Source

Less context,better answers.

TASK-043: Heavy Read

Where Token Savior wins

Three steps, no magic

Index

Query

Result

Results by category

Honest limits

Want to contribute?

Try Token Savior

Less context,
better answers.