Proven Results. Reproducible Methodology.

330 queries. 19 repos. 16 languages. Every number LLM-judged and reproducible.

1.4x

Ranking Quality

MRR, P@5, nDCG — all metrics vs expert grep

0.718

MRR

First relevant file typically at rank 1–2

330

Queries

LLM-generated, LLM-judged

Languages

Each tested on a real public repo

$0.70

42 repos

Total indexing cost

42 repos. Over 100,000 files. $0.70.

Model

OpenAI text-embedding-3-small

Per repo

~$0.017

Per file

~$0.000007

Local (Ollama)

$0.00

Multi-Repo Evaluation Results

330 queries across 19 repositories — compared against LLM-crafted grep patterns, not naive keyword search

Metric	codeindex	ripgrep	Multiplier
Precision@5	0.509	0.361	1.4x
HitRate@5	0.559	0.387	1.4x
MRR	0.718	0.499	1.4x
nDCG@10	0.561	0.400	1.4x
Recall	0.617	0.815	0.76x

Per-Repository Breakdown

codeindex dominates large codebases. Ripgrep wins on small repos with distinctive naming. We show both.

unoplatform/uno

1.2M LOC — XAML/C# naming creates broad pattern matches

11.0xMRR

Queries

HitRate@5

77%vs 4%

MRR

0.967vs 0.088

elastic/elasticsearch

Java

4.1M LOC — patterns like class.*Index match everywhere

5.0xMRR

Queries

HitRate@5

87%vs 11%

MRR

0.933vs 0.187

ReactiveX/RxSwift

Swift

Reactive patterns (subscribe, Observable) appear in every file

2.8xMRR

Queries

HitRate@5

67%vs 20%

MRR

0.850vs 0.303

langchain-ai/langchain

Python

Python monorepo with domain-specific but non-unique class names

2.7xMRR

Queries

HitRate@5

71%vs 22%

MRR

0.791vs 0.294

kubernetes/kubernetes

3.9M LOC — common patterns (func.*Handler) match thousands of files

2.4xMRR

Queries

HitRate@5

79%vs 27%

MRR

1.000vs 0.418

Kong/kong

Lua

266k LOC Lua API gateway — codeindex finds plugin handlers by intent

2.4xMRR

Queries

HitRate@5

51%vs 33%

MRR

0.819vs 0.339

discourse/discourse

Ruby

764k LOC Rails app — MVC naming conventions create broad matches

1.6xMRR

Queries

HitRate@5

58%vs 24%

MRR

0.656vs 0.402

expressjs/express

JavaScript

12k LOC — small enough that grep patterns hit the right files

0.94xMRR

Queries

HitRate@5

44%vs 66%

MRR

0.776vs 0.822

sqlite/sqlite

Distinctive function names (sqlite3_prepare) are literally greppable

0.83xMRR

Queries

HitRate@5

50%vs 56%

MRR

0.694vs 0.839

BurntSushi/ripgrep

Rust

37k LOC — small crate with function names that directly match patterns

0.65xMRR

Queries

HitRate@5

26%vs 67%

MRR

0.497vs 0.761

Per-Language Performance

16 languages, each tested on a representative public repo. 15 queries per language, LLM-judged.

Goexcellent

MRR

100%

HitRate@5

79%

kubernetes/kubernetes

C#excellent

MRR

97%

HitRate@5

77%

unoplatform/uno

PHPexcellent

MRR

97%

HitRate@5

79%

laravel/framework

Rustexcellent

MRR

96%

HitRate@5

71%

tangled.org/core

Javaexcellent

MRR

93%

HitRate@5

87%

elastic/elasticsearch

Swiftexcellent

MRR

85%

HitRate@5

67%

ReactiveX/RxSwift

Luaexcellent

MRR

82%

HitRate@5

51%

Kong/kong

Pythongood

MRR

79%

HitRate@5

71%

langchain-ai/langchain

JavaScriptgood

MRR

78%

HitRate@5

44%

expressjs/express

Kotlingood

MRR

71%

HitRate@5

65%

square/okhttp

Cgood

MRR

69%

HitRate@5

50%

sqlite/sqlite

Rubygood

MRR

66%

HitRate@5

58%

discourse/discourse

TypeScriptgood

MRR

53%

HitRate@5

48%

shadcn-ui/ui

C++limited

MRR

49%

HitRate@5

32%

nlohmann/json

Elixirlimited

MRR

42%

HitRate@5

29%

phoenixframework/phoenix

Zigbroken

MRR

HitRate@5

ghostty-org/ghostty (unindexed)

What Makes It Work — Signal Contribution

Ablation study on self-eval dataset. Multi-repo validation in progress.

Commit message similarity25.8%

Powerful in repos with strict atomic commits and descriptive messages — 25.8% MRR lift on self-eval. Effect varies widely across the broader 19-repo corpus; investigating contribution by commit hygiene level.

Parent directory boost5.9%

Propagates directory-level relevance scores to child files, so files in semantically relevant directories rank higher.

Child-to-parent0%

No measurable effect — honestly disclosed

Methodology

330 queries across 19 public repos (17 on GitHub, 2 on tangled.org), 15 queries per repo
3-step pipeline: LLM topic generation → candidate collection (codeindex + ripgrep) → LLM file-level judging
Two ripgrep baselines: rg-naive (literal strings) and rg-regex (literal + regex) — both using LLM-crafted patterns that favor ripgrep
Standard IR metrics: P@5, HitRate@5, MRR, nDCG@10, Recall, Precision@All
Reproducible: eval harness and dataset generation included in the repository

Run the eval on your codebase

Known Limitations

We publish what doesn't work alongside what does.

Ripgrep wins on small, well-named repos

On repos under ~100k LOC with descriptive identifiers (ripgrep, express, sqlite), grep patterns find files directly. codeindex's ranking advantage only matters when the search space is large enough for noise.

Zig support

ghostty-org/ghostty was not indexed during this eval (batch token limit). 7% MRR reflects an infrastructure gap, not search quality. Now fixed.

Recall vs ranking tradeoff

Ripgrep returns more total files (0.82 recall vs 0.62) because it has no result cap. codeindex returns top-32 results ranked by relevance. If you need exhaustive file lists, grep wins.

LLM-generated grep patterns favor ripgrep

The baseline grep patterns were generated by Claude Sonnet, which knows these public repos. A real developer unfamiliar with a codebase wouldn't know to search for sqlite3_prepare or MiddlewareStack. On private or unfamiliar repos, codeindex's advantage would be larger.

Cold start latency

First query requires loading embeddings. Subsequent queries are fast. On large repos (>1M LOC), codeindex is actually faster than ripgrep — 1.3s vs 1.6s average across all repos.

← Back to Overview·See Comparisons →