Reranking
Reranking is the final ranking pass. After retrieval produces a hybrid candidate list, a cross-encoder model re-scores those candidates by reading each (query, chunk) pair as a single input — rather than comparing two independent embeddings.
Cross-encoders are slower (one forward pass per candidate vs one shared embedding) but materially more accurate. A bge-reranker or cohere-rerank typically lifts top-K precision by 10–25% over hybrid alone. For chat and Q&A workloads, the latency cost is almost always worth it.
This page also covers two ranking-stage decisions that aren't reranker-specific: how many results to return and how aggressively to penalize stale documents.
Reranker enabled
Whether to run the reranker at all.
| Setting | When to use |
|---|---|
| On (recommended) | Chat, Q&A, search-style UX where ranking quality matters and an extra 50–500 ms per query is acceptable. |
| Off | Latency-critical lookup APIs, bulk export jobs, evaluation harnesses where you want to measure raw retrieval quality without the rerank layer. |
Disabling the reranker means the Final top-K is taken straight from the hybrid candidate list ranked by RRF score — no second-pass refinement.
When off, the Reranker model and Rerank pool fields are inert (still stored but not read by the runtime).
Reranker model
The cross-encoder used to re-score candidates.
| Class | Examples | Notes |
|---|---|---|
| Local | bge-reranker-v2-m3, qwen3-reranker-4b, Qwen/Qwen3-Reranker-0.6B, bge-reranker-large, mxbai-rerank-base-v1 | No outbound calls. ~50–200 ms per batch on a modern GPU; ~500 ms+ on CPU. Multilingual variants exist. |
| Cloud | cohere-rerank-3.5, voyage-rerank-2 | Best-in-class quality and latency (~50–100 ms typical). Subject to provider rate limits and per-call cost. |
The picker is disabled when Reranker enabled is off — there is nothing to rank with.
Switching the model has no ingestion-side cost. Reranking happens at query time only, so changing the model takes effect on the next query — no re-embed, no rebuild.
Rerank pool
The number of candidates fed to the reranker per query.
The hybrid stage produces up to dense_top_k + sparse_top_k distinct candidates (typically 30–60). This pool is the first N of that fused list that the reranker actually scores.
| Pool size | Effect |
|---|---|
< Final top-K | Invalid. The reranker has fewer candidates than you want returned — the form blocks this. |
Final top-K × 2 | Cheap reranking. Works when retrieval already does most of the ranking work and you mostly want a final tiebreaker. |
30–100 (typical) | The sweet spot. The reranker has room to promote chunks that ranked mid-list out of hybrid, which is its main value-add. |
> 200 | Diminishing returns and rising latency. Each extra candidate is one more cross-encoder forward pass. |
Constraint: rerank_candidates ≥ retrieval_final_top_k. Otherwise the reranker has nothing to choose from once you ask for K results.
Final top-K
The number of chunks returned to the caller after all ranking is done.
This is the number the LLM sees — every chunk above this rank gets included in the prompt context. Each chunk costs prompt tokens, so K and prompt cost scale together.
| Workload | Typical K | Why |
|---|---|---|
| Chat / Q&A | 5–10 | The LLM needs enough context to answer but not so much that it gets distracted. |
| Citation-heavy answers | 8–15 | More citations = more confidence, at a higher prompt-token cost. |
| Broad research / summarization | 20–50 | Maximize coverage of the topic; the LLM does the synthesis. |
Headless /v1/search | depends on caller | The API just returns the chunks; the consuming app decides what to do with them. |
Token budget reality check: at K = 10 with 512-token chunks, you are passing ~5,000 tokens of context per request before the system prompt or user message. At K = 50 you are at ~25,000 tokens — only a problem if your model has a tight context window or you are cost-sensitive.
Freshness floor
The minimum freshness multiplier old documents can decay to.
CoreCube applies an exponential freshness decay post-retrieval — recently synced content ranks higher, older content is demoted. Without a floor, a 5-year-old document may score ~0.05 × its semantic relevance, which effectively buries authoritative-but-old content.
The floor caps that penalty from below:
| Value | Effect |
|---|---|
0 (default) | No floor. Decay runs unclamped — preserves the original behavior. A document past its decay window can rank near-zero regardless of content. |
0.1 | Cap penalty at 10×. A 3-year-old document scores no worse than relevance × 0.1 — still demoted, never silenced. |
0.2–0.3 (recommended for archive-heavy corpora) | Mild decay. Freshness still tiebreaks but old authoritative docs (policies, ADRs, legal) stay competitive. |
0.5 | Decay barely affects rank. Use when you mostly want freshness as a tiebreaker, not a sorter. |
1.0 | Disables decay entirely — freshness no longer affects scoring. |
Set above 0 when your corpus has long-lived authoritative content that should not be demoted just because nobody re-synced it in 18 months — architectural decisions, compliance docs, design rationale, contracts.
Setting reference
When defaults differ, values are listed as default-fast / default-balanced /
default-accurate.
| Setting key | Type | Default | Range |
|---|---|---|---|
reranker_enabled | boolean | true | — |
reranker_model | string (catalog id) | retrieval tool default | — |
retrieval_rerank_candidates | integer | 30 / 50 / 120 | 1 – 500 |
retrieval_final_top_k | integer | 4 / 7 / 12 | 1 – 500 |
retrieval_freshness_floor | float | 0 | 0 – 1 |
Related
- Retrieval — the candidate generator that feeds this stage.
- Embedding — what determines what "similar" means in the candidate stage.
- How retrieval works — the end-to-end flow.