Reranking

Reranking is the final ranking pass. After retrieval produces a hybrid candidate list, a cross-encoder model re-scores those candidates by reading each (query, chunk) pair as a single input — rather than comparing two independent embeddings.

Cross-encoders are slower (one forward pass per candidate vs one shared embedding) but materially more accurate. A bge-reranker or cohere-rerank typically lifts top-K precision by 10–25% over hybrid alone. For chat and Q&A workloads, the latency cost is almost always worth it.

This page also covers two ranking-stage decisions that aren't reranker-specific: how many results to return and how aggressively to penalize stale documents.

Reranker enabled

Whether to run the reranker at all.

Setting	When to use
On (recommended)	Chat, Q&A, search-style UX where ranking quality matters and an extra 50–500 ms per query is acceptable.
Off	Latency-critical lookup APIs, bulk export jobs, evaluation harnesses where you want to measure raw retrieval quality without the rerank layer.

Disabling the reranker means the Final top-K is taken straight from the hybrid candidate list ranked by RRF score — no second-pass refinement.

When off, the Reranker model and Rerank pool fields are inert (still stored but not read by the runtime).

Reranker model

The cross-encoder used to re-score candidates.

Class	Examples	Notes
Local	`bge-reranker-v2-m3`, `qwen3-reranker-4b`, `Qwen/Qwen3-Reranker-0.6B`, `bge-reranker-large`, `mxbai-rerank-base-v1`	No outbound calls. ~50–200 ms per batch on a modern GPU; ~500 ms+ on CPU. Multilingual variants exist.
Cloud	`cohere-rerank-3.5`, `voyage-rerank-2`	Best-in-class quality and latency (~50–100 ms typical). Subject to provider rate limits and per-call cost.

The picker is disabled when Reranker enabled is off — there is nothing to rank with.

Switching the model has no ingestion-side cost. Reranking happens at query time only, so changing the model takes effect on the next query — no re-embed, no rebuild.

Rerank pool

The number of candidates fed to the reranker per query.

The hybrid stage produces up to dense_top_k + sparse_top_k distinct candidates (typically 30–60). This pool is the first N of that fused list that the reranker actually scores.

Pool size	Effect
`< Final top-K`	Invalid. The reranker has fewer candidates than you want returned — the form blocks this.
`Final top-K × 2`	Cheap reranking. Works when retrieval already does most of the ranking work and you mostly want a final tiebreaker.
`30–100` (typical)	The sweet spot. The reranker has room to promote chunks that ranked mid-list out of hybrid, which is its main value-add.
`> 200`	Diminishing returns and rising latency. Each extra candidate is one more cross-encoder forward pass.

Constraint: rerank_candidates ≥ retrieval_final_top_k. Otherwise the reranker has nothing to choose from once you ask for K results.

Final top-K

The number of chunks returned to the caller after all ranking is done.

This is the number the LLM sees — every chunk above this rank gets included in the prompt context. Each chunk costs prompt tokens, so K and prompt cost scale together.

Workload	Typical K	Why
Chat / Q&A	5–10	The LLM needs enough context to answer but not so much that it gets distracted.
Citation-heavy answers	8–15	More citations = more confidence, at a higher prompt-token cost.
Broad research / summarization	20–50	Maximize coverage of the topic; the LLM does the synthesis.
Headless `/v1/search`	depends on caller	The API just returns the chunks; the consuming app decides what to do with them.

Token budget reality check: at K = 10 with 512-token chunks, you are passing ~5,000 tokens of context per request before the system prompt or user message. At K = 50 you are at ~25,000 tokens — only a problem if your model has a tight context window or you are cost-sensitive.

Freshness floor

The minimum freshness multiplier old documents can decay to.

CoreCube applies an exponential freshness decay post-retrieval — recently synced content ranks higher, older content is demoted. Without a floor, a 5-year-old document may score ~0.05 × its semantic relevance, which effectively buries authoritative-but-old content.

The floor caps that penalty from below:

Value	Effect
`0` (default)	No floor. Decay runs unclamped — preserves the original behavior. A document past its decay window can rank near-zero regardless of content.
`0.1`	Cap penalty at 10×. A 3-year-old document scores no worse than `relevance × 0.1` — still demoted, never silenced.
`0.2–0.3` (recommended for archive-heavy corpora)	Mild decay. Freshness still tiebreaks but old authoritative docs (policies, ADRs, legal) stay competitive.
`0.5`	Decay barely affects rank. Use when you mostly want freshness as a tiebreaker, not a sorter.
`1.0`	Disables decay entirely — freshness no longer affects scoring.

Set above 0 when your corpus has long-lived authoritative content that should not be demoted just because nobody re-synced it in 18 months — architectural decisions, compliance docs, design rationale, contracts.

Setting reference

When defaults differ, values are listed as default-fast / default-balanced / default-accurate.

Setting key	Type	Default	Range
`reranker_enabled`	boolean	`true`	—
`reranker_model`	string (catalog id)	retrieval tool default	—
`retrieval_rerank_candidates`	integer	`30` / `50` / `120`	`1` – `500`
`retrieval_final_top_k`	integer	`4` / `7` / `12`	`1` – `500`
`retrieval_freshness_floor`	float	`0`	`0` – `1`

Retrieval — the candidate generator that feeds this stage.
Embedding — what determines what "similar" means in the candidate stage.
How retrieval works — the end-to-end flow.

Reranker enabled​

Reranker model​

Rerank pool​

Final top-K​

Freshness floor​

Setting reference​

Related​