330 queries. 19 repos. 16 languages. Every number LLM-judged and reproducible.
330 queries across 19 repositories — compared against LLM-crafted grep patterns, not naive keyword search
| Metric | codeindex | ripgrep | Multiplier |
|---|---|---|---|
| Precision@5 | 0.509 | 0.361 | 1.4x |
| HitRate@5 | 0.559 | 0.387 | 1.4x |
| MRR | 0.718 | 0.499 | 1.4x |
| nDCG@10 | 0.561 | 0.400 | 1.4x |
| Recall | 0.617 | 0.815 | 0.76x |
codeindex dominates large codebases. Ripgrep wins on small repos with distinctive naming. We show both.
1.2M LOC — XAML/C# naming creates broad pattern matches
4.1M LOC — patterns like class.*Index match everywhere
Reactive patterns (subscribe, Observable) appear in every file
Python monorepo with domain-specific but non-unique class names
3.9M LOC — common patterns (func.*Handler) match thousands of files
266k LOC Lua API gateway — codeindex finds plugin handlers by intent
764k LOC Rails app — MVC naming conventions create broad matches
12k LOC — small enough that grep patterns hit the right files
Distinctive function names (sqlite3_prepare) are literally greppable
37k LOC — small crate with function names that directly match patterns
16 languages, each tested on a representative public repo. 15 queries per language, LLM-judged.
Ablation study on self-eval dataset. Multi-repo validation in progress.
Powerful in repos with strict atomic commits and descriptive messages — 25.8% MRR lift on self-eval. Effect varies widely across the broader 19-repo corpus; investigating contribution by commit hygiene level.
Propagates directory-level relevance scores to child files, so files in semantically relevant directories rank higher.
No measurable effect — honestly disclosed
Run the eval on your codebase
We publish what doesn't work alongside what does.
On repos under ~100k LOC with descriptive identifiers (ripgrep, express, sqlite), grep patterns find files directly. codeindex's ranking advantage only matters when the search space is large enough for noise.
ghostty-org/ghostty was not indexed during this eval (batch token limit). 7% MRR reflects an infrastructure gap, not search quality. Now fixed.
Ripgrep returns more total files (0.82 recall vs 0.62) because it has no result cap. codeindex returns top-32 results ranked by relevance. If you need exhaustive file lists, grep wins.
The baseline grep patterns were generated by Claude Sonnet, which knows these public repos. A real developer unfamiliar with a codebase wouldn't know to search for sqlite3_prepare or MiddlewareStack. On private or unfamiliar repos, codeindex's advantage would be larger.
First query requires loading embeddings. Subsequent queries are fast. On large repos (>1M LOC), codeindex is actually faster than ripgrep — 1.3s vs 1.6s average across all repos.