
The reranker's quality ceiling is defined by the candidate set it receives from retrieval. This becomes most visible in external rerankers, but it applies to any multi-stage ranking system where retrieval and ranking are separated. Once this ceiling is reached, further gains rarely come from model tuning alone. They require changes to retrieval, candidate generation, or the structure of the pipeline itself.
Retrieval defines eligibility, reranking defines order
Before reranking, the system has already decided what is eligible to rank: query interpretation, retrieval strategy, lexical and semantic fusion, filtering, permissions, and latency constraints all shape the candidate set.
If a document is not retrieved, it never participates in ranking, and no downstream stage can reintroduce it.
This creates a hard constraint: reranking can only reorder what retrieval surfaces. It cannot recover missing documents. As a result, retrieval quality and reranking quality are different properties:
- retrieval controls what enters the system (recall)
- reranking controls how that subset is ordered
A strong reranker can produce good top-k results even when retrieval is incomplete. But the system may still miss relevant documents entirely, even if ranking quality looks strong.
The external reranker tradeoff
External reranking typically follows a simple pattern: retrieve candidates, compute features, and reorder results with a stronger model. This works well when relevance can be inferred from document-level signals like text, metadata, or embeddings.
The limitation appears when ranking depends on query–document interaction structure built during retrieval. The retrieval stage can compute rich signals: term matches and coverage, field-level contributions, proximity and alignment, query-specific scoring components that describe how the document matched the query, not just what it contains.
External rerankers typically do not receive this full interaction structure. They operate on a reduced representation: document text, metadata, embeddings, and a limited set of precomputed features. This produces a structural tradeoff:
more candidates → less per-candidate retrieval context
more context → fewer candidates
Reranking becomes compensatory under constraint
Two-stage retrieval is a standard production pattern because full-corpus scoring is too expensive at query time. The shift happens when reranking stops refining a good candidate set and starts compensating for weak retrieval. In this regime, performance gains often come from increasing the rerank window rather than improving retrieval itself. The transition is observable: a system where ranking quality improves by widening the rerank window, rather than by improving retrieval, is a system where reranking has become compensatory. The window size is no longer a latency knob; it is load-bearing. Retrieval is no longer the quality driver; it is the constraint.
A query like red waterproof hiking shoes size 11 illustrates the ceiling. Lexical retrieval may preserve exact constraints but miss semantically relevant products. Semantic retrieval may capture relevant footwear but lose attribute precision. The reranker cannot recover what was never retrieved. In production, this often shows up as stable top-3 relevance scores while long-tail recall silently degrades across query variants.
Ranking is a projection, not a recomputation
Retrieval does not just filter documents: it computes a rich query-time representation: term matches, field contributions, proximity scores, BM25 components, semantic embeddings. That computation happens once, against the full index, under the constraints of query time.
Reranking does not redo that computation. It operates on whatever survives into the candidate set: document text, metadata, a reduced feature vector, maybe a few exported signals. The retrieval-time computation is not rerun: it is partially preserved, partially discarded, and the reranker works with what remains.
Ranking in a two-stage system is therefore not a single scoring function applied to documents. It is a projection: retrieval-time signals compressed into a candidate representation, then re-scored under a different model with different inputs. That projection is lossy by construction: not a failure of implementation, but the structural consequence of separating retrieval from ranking. Every gain the reranker makes operates within the bounds of what that projection preserved.
Evaluation splits across stages
Once the system is multi-stage, evaluation splits along the same boundary. Retrieval is measured in recall (e.g. Recall@K): whether relevant documents appear in the candidate set. Reranking is measured using ranking metrics like NDCG or MRR: how well the system orders a fixed candidate set. These are partially independent: NDCG can improve while recall is weak, and retrieval improvements may not immediately affect top-k ordering.
The failure mode this creates is asymmetric visibility: reranking metrics improve, user-visible relevance plateaus, and the gap between them is invisible to any single-stage evaluation. Neither team's dashboard shows the full picture.
Ownership of relevance splits
The technical decomposition is mirrored organizationally, and this is why the failure mode persists rather than getting fixed.
Retrieval is owned by engineering. Reranking is owned by data science or ML. Engineering optimizes for indexing correctness, query latency, recall, and system reliability. ML optimizes for model quality, feature engineering, and offline ranking metrics. They use different tooling, different evaluation pipelines, and different definitions of what "better" means. They rarely share a common view of the full query path.
The consequence is structural: relevance is no longer a single optimization target. A retrieval change shifts the input distribution the reranker was tuned against. A reranker improvement masks retrieval weaknesses, removing pressure to fix them. Each system looks correct under its own metrics, on its own eval set, with its own assumptions about what the other stage is providing.
Local correctness does not imply global correctness. That gap does not close by improving either system in isolation: it closes only when someone is accountable for the space between them.
The real boundary
Rerankers are a standard component in modern search systems. The problem is not their presence but what they are asked to do.
When reranking becomes the dominant driver of final ordering quality, the system has quietly transferred the relevance problem downstream. Retrieval becomes infrastructure; the reranker becomes the product. That transfer is rarely deliberate: it accumulates through incremental fixes, each locally reasonable, until the reranker is doing work it was never designed to own and retrieval quality has stopped mattering to anyone's roadmap.
The system still functions. Metrics still improve. But the thing that was supposed to be a refinement step is now load-bearing, operating on a lossy projection of retrieval-time signals, optimized by a team that does not own the stage it depends on most.
That is the boundary, and most teams cross it before they name it.