Less context,
better answers.

Token Savior replaces raw file reads with structural queries. Measured across 60 real coding tasks against baseline Claude Code tools.

Claude Sonnet 4 · April 2026 · tsbench v1.0 · token-savior v2.5.1
0%
Chars injected
56%
Accuracy
0 / 0
Won / Lost
0
Impossible w/o TS
Best improvement

TASK-043: Heavy Read

Read all functions from 6 large files and summarize architecture patterns.
Baseline (Run A)
Chars
196,882
Time
159.2s
Calls
13 Reads
Score
1/2
Token Savior (Run B)
Chars
16,416
Time
42.9s
Calls
6 queries
Score
2/2
Baseline
197K chars
159.2s
TS
16K chars
42.9s
Key comparisons

Where Token Savior wins

Chars injected into contextcumulative across 60 tasks
Baseline
1,431,624
1.43M
TS
234,805
235K
Score (accuracy)out of 120 possible points
Baseline
67 / 120
56%
TS
115 / 120
96%
Total turnsLLM round-trips needed
Baseline
733
733
TS
435
435
How it works

Three steps, no magic

Step 1

Index

Token Savior parses your codebase into a structural graph of functions, classes, and dependencies. One call to switch_project.

Step 2

Query

The agent queries specific symbols instead of reading entire files. get_function_source, get_dependents, get_call_chain.

Step 3

Result

84% fewer characters but better-structured information. The signal-to-noise ratio improves, and accuracy jumps from 56% to 96%.

Breakdown

Results by category

Click a row to expand details.

Category N Score A Score B Delta Wall A Wall B
Transparency

Honest limits

!
active_tokens +29% cumulés. The 84% chars reduction is offset by the MCP schema cache_creation cost. TS still wins active on 28/59 tasks (heavy_read, audit, impact, navigation), but loses on micro-tasks where the fixed schema cost dominates.
!
call_chain category +88% wall time. Structural get_call_chain queries are slower per-call than Grep heuristics. Score still climbs from 5/8 to 8/8, so the trade-off is correctness over latency.
!
Simple localisation +24% wall. Single-symbol lookups pay MCP round-trip latency: switch_project + find_symbol ~6s vs one Grep at ~3s. Score still improves (1.17 → 1.83) thanks to fewer false negatives.
!
0 net score regressions. TS wins 32 tasks, ties 28, loses 0. 21 tasks are flat-out impossible without TS (score 0/2 baseline → ≥1/2 with TS), spanning config, infra, audit, debug, cross-language.
Open benchmark

Want to contribute?

tsbench is open and reproducible. python generate.py --seed 42 gives the same 60-task project every time.

Run it on your agent:

If you run the benchmark on another agent and want to submit results, open a PR or issue. We'll add your results to the leaderboard.

Try Token Savior

Structural code navigation for Claude Code.
Less context, better answers.