Retrieval Pipeline
CoreCube's retrieval pipeline transforms a user query into a grounded, cited answer by composing the active preset pipeline, running scoped hybrid search, applying query/chunk tools, and assembling context before passing anything to an LLM.
Pipeline overview
Preset resolution (step 0)
Every query runs through the active preset binding:
- Resolve the caller's organization or compartment preset.
- Load the binding's resolved pipeline snapshot.
- Read the preset answer model and tool-loop bounds.
- Use the preset's default retrieval tool as the source of query-time retrieval settings.
The snapshot contains query tools, default retrieval, optional LLM-callable retrieval/action capabilities, chunk tools, and prompt fragments grouped by stage. This is why two presets can use the same corpus but behave differently: the retrieval profile, tool chain, answer model, and prompt composition are all part of the preset.
Query processing (step 1)
When a query arrives at /v1/chat/completions:
- Admission control — API-key rate limits, chat concurrency limits, and provider catalog limits protect the server from overload.
- Query extraction — The last user message is extracted for retrieval; full conversation history still goes to the answer LLM.
- Scope resolution — API key → effective user → allowed connections/scopes (compartments + sensitivity levels).
- Query tools — Active query tools can rewrite the query into semantic and keyword variants.
- Query embedding — The query text is embedded using the same model family as document chunks.
Hybrid search (step 2)
The default retrieval tool controls the dense pool, sparse pool, fusion weights, score floor,
freshness floor, HNSW ef_search, reranker model, rerank pool, and final top-K. Two search legs run
in parallel and are fused:
Full-text search uses PostgreSQL's websearch_to_tsquery with weighted tsvectors:
- Weight A — document title (highest signal)
- Weight B — heading path (e.g., "Deployment > Staging > Prerequisites")
- Weight C — chunk body content
Vector search uses pgvector's HNSW index for approximate nearest neighbor with cosine distance.
RRF fusion combines both ranked lists. Documents appearing in both lists are ranked higher than
documents in only one list. The default presets use rrf_k = 60, with vector and FTS stream weights
stored on the retrieval tool.
Ranking and filtering (step 3)
The fused candidate list passes through ranking and filtering:
| Stage | Description |
|---|---|
| Freshness decay | Exponential decay based on time since last sync. Configurable half-life. Recently synced content ranks higher. |
| Quality filter | Exclude boilerplate chunks (navigation menus, footers, repeated disclaimers) below a quality score threshold |
| ACL filter | Enforce scope — remove any chunks from connections outside the query's allowed compartments/sensitivity |
| Cross-encoder reranker | Score each (query, chunk) pair jointly. Candidate pool, final top-K, enabled flag, and model come from the retrieval tool. Graceful fallback to fusion ranking when reranking is disabled or unavailable. |
| Chunk tools | Optional preset tools can re-score, filter, or expand the candidate list before answer generation. |
Hybrid ranking weights
| Signal | Default source |
|---|---|
| Vector similarity | Retrieval tool dense stream |
| Full-text relevance | Retrieval tool sparse stream |
| Freshness boost | Retrieval tool + freshness_decay chunk tool |
When scores are equal (within 0.01 tolerance), source trust level breaks the tie: authoritative > reference > volatile.
Context assembly (step 4)
Context is formatted as:
### [1] Deployment Runbook — Confluence — Engineering
...chunk content...
### [2] Incident Response Guide — Confluence — Engineering
...chunk content...
The numbered headers map directly to citation references in the response.
LLM generation (step 5)
- Prompt assembly — Answer-stage fragments + active tool prompts + invariant footer + formatted context + conversation history.
- Answer model selection — The active preset's answer model is used when set; otherwise the default LLM provider is selected.
- Tool-call loop — Non-streaming requests can let the answer LLM call retrieval or action capabilities attached to the preset.
- Streaming — SSE chunks in OpenAI format are forwarded to the client. Extended streaming adds a
cube.metadataframe before[DONE]. - Citation mapping —
[N]references in the response text map to the numbered context chunks.
Ingestion pipeline
How documents get from source systems into the evidence layer:
Ingestion is idempotent. Changed documents replace chunks through an atomic chunk replacement path, unchanged external documents can repair missing chunk state, and cross-source content deduplication does not reuse another connection's storage key.
Chunking strategies
| Content type | Strategy | Behavior |
|---|---|---|
| Markdown / docs | Heading-aware | Split at H1/H2/H3 boundaries, preserve section hierarchy |
| Code files | Code-aware | Split at function/class boundaries, keep imports with first chunk |
| HTML pages | Heading-aware | Extract to markdown first, then heading-based chunking |
| Paragraph-based | Paragraph boundaries (PDFs lack reliable heading structure) | |
| Plain text | Fixed-size | Fixed-size with configurable overlap |
| JSON / structured | Record-based | Each top-level record or array item becomes a chunk |
| Tables | Intact | Tables kept whole when possible, headers repeated when split |
Default chunk size: 512 tokens target, 1024 token maximum. Overlap: 50 tokens.
Query Explorer
Use the Query Explorer in the Admin Console to inspect the full pipeline for any query:
- Score breakdown per chunk (vector, FTS, freshness, combined)
- Active preset, retrieval tool, query tools, and chunk tools
- Side-by-side search mode comparison (vector-only vs FTS-only vs hybrid)
- Which connections were included or excluded by scope
- Context assembly trace (which chunks selected, filtered, deduplicated, budget-truncated)
- Answerability assessment with confidence level