Skip to main content

Reranking

Reranking is the final ranking pass. After retrieval produces a hybrid candidate list, a cross-encoder model re-scores those candidates by reading each (query, chunk) pair as a single input — rather than comparing two independent embeddings.

Cross-encoders are slower (one forward pass per candidate vs one shared embedding) but materially more accurate. A bge-reranker or cohere-rerank typically lifts top-K precision by 10–25% over hybrid alone. For chat and Q&A workloads, the latency cost is almost always worth it.

This page also covers two ranking-stage decisions that aren't reranker-specific: how many results to return and how aggressively to penalize stale documents.


Reranker enabled

Whether to run the reranker at all.

SettingWhen to use
On (recommended)Chat, Q&A, search-style UX where ranking quality matters and an extra 50–500 ms per query is acceptable.
OffLatency-critical lookup APIs, bulk export jobs, evaluation harnesses where you want to measure raw retrieval quality without the rerank layer.

Disabling the reranker means the Final top-K is taken straight from the hybrid candidate list ranked by RRF score — no second-pass refinement.

When off, the Reranker model and Rerank pool fields are inert (still stored but not read by the runtime).


Reranker model

The cross-encoder used to re-score candidates.

ClassExamplesNotes
Localbge-reranker-v2-m3, qwen3-reranker-4b, Qwen/Qwen3-Reranker-0.6B, bge-reranker-large, mxbai-rerank-base-v1No outbound calls. ~50–200 ms per batch on a modern GPU; ~500 ms+ on CPU. Multilingual variants exist.
Cloudcohere-rerank-3.5, voyage-rerank-2Best-in-class quality and latency (~50–100 ms typical). Subject to provider rate limits and per-call cost.

The picker is disabled when Reranker enabled is off — there is nothing to rank with.

Switching the model has no ingestion-side cost. Reranking happens at query time only, so changing the model takes effect on the next query — no re-embed, no rebuild.


Rerank pool

The number of candidates fed to the reranker per query.

The hybrid stage produces up to dense_top_k + sparse_top_k distinct candidates (typically 30–60). This pool is the first N of that fused list that the reranker actually scores.

Pool sizeEffect
< Final top-KInvalid. The reranker has fewer candidates than you want returned — the form blocks this.
Final top-K × 2Cheap reranking. Works when retrieval already does most of the ranking work and you mostly want a final tiebreaker.
30–100 (typical)The sweet spot. The reranker has room to promote chunks that ranked mid-list out of hybrid, which is its main value-add.
> 200Diminishing returns and rising latency. Each extra candidate is one more cross-encoder forward pass.

Constraint: rerank_candidates ≥ retrieval_final_top_k. Otherwise the reranker has nothing to choose from once you ask for K results.


Final top-K

The number of chunks returned to the caller after all ranking is done.

This is the number the LLM sees — every chunk above this rank gets included in the prompt context. Each chunk costs prompt tokens, so K and prompt cost scale together.

WorkloadTypical KWhy
Chat / Q&A5–10The LLM needs enough context to answer but not so much that it gets distracted.
Citation-heavy answers8–15More citations = more confidence, at a higher prompt-token cost.
Broad research / summarization20–50Maximize coverage of the topic; the LLM does the synthesis.
Headless /v1/searchdepends on callerThe API just returns the chunks; the consuming app decides what to do with them.

Token budget reality check: at K = 10 with 512-token chunks, you are passing ~5,000 tokens of context per request before the system prompt or user message. At K = 50 you are at ~25,000 tokens — only a problem if your model has a tight context window or you are cost-sensitive.


Freshness floor

The minimum freshness multiplier old documents can decay to.

CoreCube applies an exponential freshness decay post-retrieval — recently synced content ranks higher, older content is demoted. Without a floor, a 5-year-old document may score ~0.05 × its semantic relevance, which effectively buries authoritative-but-old content.

The floor caps that penalty from below:

ValueEffect
0 (default)No floor. Decay runs unclamped — preserves the original behavior. A document past its decay window can rank near-zero regardless of content.
0.1Cap penalty at 10×. A 3-year-old document scores no worse than relevance × 0.1 — still demoted, never silenced.
0.2–0.3 (recommended for archive-heavy corpora)Mild decay. Freshness still tiebreaks but old authoritative docs (policies, ADRs, legal) stay competitive.
0.5Decay barely affects rank. Use when you mostly want freshness as a tiebreaker, not a sorter.
1.0Disables decay entirely — freshness no longer affects scoring.

Set above 0 when your corpus has long-lived authoritative content that should not be demoted just because nobody re-synced it in 18 months — architectural decisions, compliance docs, design rationale, contracts.


Setting reference

When defaults differ, values are listed as default-fast / default-balanced / default-accurate.

Setting keyTypeDefaultRange
reranker_enabledbooleantrue
reranker_modelstring (catalog id)retrieval tool default
retrieval_rerank_candidatesinteger30 / 50 / 1201500
retrieval_final_top_kinteger4 / 7 / 121500
retrieval_freshness_floorfloat001

We use cookies for analytics to improve our website. More information in our Privacy Policy.