Retrieval Pipeline
CoreCube's retrieval pipeline transforms a user query into a grounded, cited answer by combining hybrid search, reranking, and context assembly before passing anything to an LLM.
Pipeline overview
Query processing (step 1)
When a query arrives at /v1/chat/completions:
- Query extraction — The last user message is extracted (500 character cap for the retrieval query; full message is still sent to the LLM)
- Scope resolution — API key → user → allowed connections (compartments + sensitivity levels)
- Query embedding — The query text is embedded using the same model as document chunks (CoreCube Inference or a cloud embedding API)
Hybrid search (step 2)
Two search legs run in parallel and are fused:
Full-text search uses PostgreSQL's websearch_to_tsquery with weighted tsvectors:
- Weight A — document title (highest signal)
- Weight B — heading path (e.g., "Deployment > Staging > Prerequisites")
- Weight C — chunk body content
Vector search uses pgvector's HNSW index for approximate nearest neighbor with cosine distance.
RRF fusion (K=60) combines both ranked lists. Documents appearing in both lists are ranked higher than documents in only one list.
Ranking and filtering (step 3)
The fused candidate list passes through four sequential filters:
| Stage | Description |
|---|---|
| Freshness decay | Exponential decay based on time since last sync. Configurable half-life. Recently synced content ranks higher. |
| Quality filter | Exclude boilerplate chunks (navigation menus, footers, repeated disclaimers) below a quality score threshold |
| ACL filter | Enforce scope — remove any chunks from connections outside the query's allowed compartments/sensitivity |
| Cross-encoder reranker | Score each (query, chunk) pair jointly. Top-30 candidates → top-10. Graceful fallback to fusion ranking when no reranker is configured. |
Hybrid ranking weights
| Signal | Default weight |
|---|---|
| Vector similarity | 60% |
| Full-text relevance | 30% |
| Freshness boost | 10% |
When scores are equal (within 0.01 tolerance), source trust level breaks the tie: authoritative > reference > volatile.
Context assembly (step 4)
Context is formatted as:
### [1] Deployment Runbook — Confluence — Engineering
...chunk content...
### [2] Incident Response Guide — Confluence — Engineering
...chunk content...
The numbered headers map directly to citation references in the response.
LLM generation (step 5)
- Prompt assembly — System prompt + formatted context + citation instructions + conversation history
- LLM routing — Select provider based on routing rules (default provider, scope override, or user override)
- Streaming — SSE chunks in OpenAI format forwarded to the client
- Citation mapping —
[N]references in the response text map to the numbered context chunks
Ingestion pipeline
How documents get from source systems into the evidence layer:
Chunking strategies
| Content type | Strategy | Behavior |
|---|---|---|
| Markdown / docs | Heading-aware | Split at H1/H2/H3 boundaries, preserve section hierarchy |
| Code files | Code-aware | Split at function/class boundaries, keep imports with first chunk |
| HTML pages | Heading-aware | Extract to markdown first, then heading-based chunking |
| Paragraph-based | Paragraph boundaries (PDFs lack reliable heading structure) | |
| Plain text | Fixed-size | Fixed-size with configurable overlap |
| JSON / structured | Record-based | Each top-level record or array item becomes a chunk |
| Tables | Intact | Tables kept whole when possible, headers repeated when split |
Default chunk size: 512 tokens target, 1024 token maximum. Overlap: 50 tokens.
Query Explorer
Use the Query Explorer in the Admin Console to inspect the full pipeline for any query:
- Score breakdown per chunk (vector, FTS, freshness, combined)
- Side-by-side search mode comparison (vector-only vs FTS-only vs hybrid)
- Which connections were included or excluded by scope
- Context assembly trace (which chunks selected, filtered, deduplicated, budget-truncated)
- Answerability assessment with confidence level