Proven Results. Reproducible Methodology.

330 queries. 19 repos. 16 languages. Every number LLM-judged and reproducible.

1.4x
Ranking Quality
MRR, P@5, nDCG — all metrics vs expert grep
0.718
MRR
First relevant file typically at rank 1–2
330
Queries
LLM-generated, LLM-judged
16
Languages
Each tested on a real public repo
$0.70
42 repos
Total indexing cost

42 repos. Over 100,000 files. $0.70.

Model
OpenAI text-embedding-3-small
Per repo
~$0.017
Per file
~$0.000007
Local (Ollama)
$0.00

Multi-Repo Evaluation Results

330 queries across 19 repositories — compared against LLM-crafted grep patterns, not naive keyword search

MetriccodeindexripgrepMultiplier
Precision@50.5090.3611.4x
HitRate@50.5590.3871.4x
MRR0.7180.4991.4x
nDCG@100.5610.4001.4x
Recall0.6170.8150.76x

Per-Repository Breakdown

codeindex dominates large codebases. Ripgrep wins on small repos with distinctive naming. We show both.

unoplatform/uno

C#

1.2M LOC — XAML/C# naming creates broad pattern matches

11.0xMRR
Queries
15
HitRate@5
77%vs 4%
MRR
0.967vs 0.088

elastic/elasticsearch

Java

4.1M LOC — patterns like class.*Index match everywhere

5.0xMRR
Queries
15
HitRate@5
87%vs 11%
MRR
0.933vs 0.187

ReactiveX/RxSwift

Swift

Reactive patterns (subscribe, Observable) appear in every file

2.8xMRR
Queries
15
HitRate@5
67%vs 20%
MRR
0.850vs 0.303

langchain-ai/langchain

Python

Python monorepo with domain-specific but non-unique class names

2.7xMRR
Queries
15
HitRate@5
71%vs 22%
MRR
0.791vs 0.294

kubernetes/kubernetes

Go

3.9M LOC — common patterns (func.*Handler) match thousands of files

2.4xMRR
Queries
15
HitRate@5
79%vs 27%
MRR
1.000vs 0.418

Kong/kong

Lua

266k LOC Lua API gateway — codeindex finds plugin handlers by intent

2.4xMRR
Queries
15
HitRate@5
51%vs 33%
MRR
0.819vs 0.339

discourse/discourse

Ruby

764k LOC Rails app — MVC naming conventions create broad matches

1.6xMRR
Queries
15
HitRate@5
58%vs 24%
MRR
0.656vs 0.402

expressjs/express

JavaScript

12k LOC — small enough that grep patterns hit the right files

0.94xMRR
Queries
15
HitRate@5
44%vs 66%
MRR
0.776vs 0.822

sqlite/sqlite

C

Distinctive function names (sqlite3_prepare) are literally greppable

0.83xMRR
Queries
15
HitRate@5
50%vs 56%
MRR
0.694vs 0.839

BurntSushi/ripgrep

Rust

37k LOC — small crate with function names that directly match patterns

0.65xMRR
Queries
15
HitRate@5
26%vs 67%
MRR
0.497vs 0.761

Per-Language Performance

16 languages, each tested on a representative public repo. 15 queries per language, LLM-judged.

Goexcellent
MRR
100%
HitRate@5
79%
kubernetes/kubernetes
C#excellent
MRR
97%
HitRate@5
77%
unoplatform/uno
PHPexcellent
MRR
97%
HitRate@5
79%
laravel/framework
Rustexcellent
MRR
96%
HitRate@5
71%
tangled.org/core
Javaexcellent
MRR
93%
HitRate@5
87%
elastic/elasticsearch
Swiftexcellent
MRR
85%
HitRate@5
67%
ReactiveX/RxSwift
Luaexcellent
MRR
82%
HitRate@5
51%
Kong/kong
Pythongood
MRR
79%
HitRate@5
71%
langchain-ai/langchain
JavaScriptgood
MRR
78%
HitRate@5
44%
expressjs/express
Kotlingood
MRR
71%
HitRate@5
65%
square/okhttp
Cgood
MRR
69%
HitRate@5
50%
sqlite/sqlite
Rubygood
MRR
66%
HitRate@5
58%
discourse/discourse
TypeScriptgood
MRR
53%
HitRate@5
48%
shadcn-ui/ui
C++limited
MRR
49%
HitRate@5
32%
nlohmann/json
Elixirlimited
MRR
42%
HitRate@5
29%
phoenixframework/phoenix
Zigbroken
MRR
7%
HitRate@5
3%
ghostty-org/ghostty (unindexed)

What Makes It Work — Signal Contribution

Ablation study on self-eval dataset. Multi-repo validation in progress.

Commit message similarity25.8%

Powerful in repos with strict atomic commits and descriptive messages — 25.8% MRR lift on self-eval. Effect varies widely across the broader 19-repo corpus; investigating contribution by commit hygiene level.

Parent directory boost5.9%

Propagates directory-level relevance scores to child files, so files in semantically relevant directories rank higher.

Child-to-parent0%

No measurable effect — honestly disclosed

Methodology

  • 330 queries across 19 public repos (17 on GitHub, 2 on tangled.org), 15 queries per repo
  • 3-step pipeline: LLM topic generation → candidate collection (codeindex + ripgrep) → LLM file-level judging
  • Two ripgrep baselines: rg-naive (literal strings) and rg-regex (literal + regex) — both using LLM-crafted patterns that favor ripgrep
  • Standard IR metrics: P@5, HitRate@5, MRR, nDCG@10, Recall, Precision@All
  • Reproducible: eval harness and dataset generation included in the repository

Run the eval on your codebase

Known Limitations

We publish what doesn't work alongside what does.

Ripgrep wins on small, well-named repos

On repos under ~100k LOC with descriptive identifiers (ripgrep, express, sqlite), grep patterns find files directly. codeindex's ranking advantage only matters when the search space is large enough for noise.

Zig support

ghostty-org/ghostty was not indexed during this eval (batch token limit). 7% MRR reflects an infrastructure gap, not search quality. Now fixed.

Recall vs ranking tradeoff

Ripgrep returns more total files (0.82 recall vs 0.62) because it has no result cap. codeindex returns top-32 results ranked by relevance. If you need exhaustive file lists, grep wins.

LLM-generated grep patterns favor ripgrep

The baseline grep patterns were generated by Claude Sonnet, which knows these public repos. A real developer unfamiliar with a codebase wouldn't know to search for sqlite3_prepare or MiddlewareStack. On private or unfamiliar repos, codeindex's advantage would be larger.

Cold start latency

First query requires loading embeddings. Subsequent queries are fast. On large repos (>1M LOC), codeindex is actually faster than ripgrep — 1.3s vs 1.6s average across all repos.