Skip to main content

Retrieval Pipeline

CoreCube's retrieval pipeline transforms a user query into a grounded, cited answer by combining hybrid search, reranking, and context assembly before passing anything to an LLM.

Pipeline overview

Query processing (step 1)

When a query arrives at /v1/chat/completions:

  1. Query extraction — The last user message is extracted (500 character cap for the retrieval query; full message is still sent to the LLM)
  2. Scope resolution — API key → user → allowed connections (compartments + sensitivity levels)
  3. Query embedding — The query text is embedded using the same model as document chunks (CoreCube Inference or a cloud embedding API)

Hybrid search (step 2)

Two search legs run in parallel and are fused:

Full-text search uses PostgreSQL's websearch_to_tsquery with weighted tsvectors:

  • Weight A — document title (highest signal)
  • Weight B — heading path (e.g., "Deployment > Staging > Prerequisites")
  • Weight C — chunk body content

Vector search uses pgvector's HNSW index for approximate nearest neighbor with cosine distance.

RRF fusion (K=60) combines both ranked lists. Documents appearing in both lists are ranked higher than documents in only one list.

Ranking and filtering (step 3)

The fused candidate list passes through four sequential filters:

StageDescription
Freshness decayExponential decay based on time since last sync. Configurable half-life. Recently synced content ranks higher.
Quality filterExclude boilerplate chunks (navigation menus, footers, repeated disclaimers) below a quality score threshold
ACL filterEnforce scope — remove any chunks from connections outside the query's allowed compartments/sensitivity
Cross-encoder rerankerScore each (query, chunk) pair jointly. Top-30 candidates → top-10. Graceful fallback to fusion ranking when no reranker is configured.

Hybrid ranking weights

SignalDefault weight
Vector similarity60%
Full-text relevance30%
Freshness boost10%

When scores are equal (within 0.01 tolerance), source trust level breaks the tie: authoritative > reference > volatile.

Context assembly (step 4)

Context is formatted as:

### [1] Deployment Runbook — Confluence — Engineering
...chunk content...

### [2] Incident Response Guide — Confluence — Engineering
...chunk content...

The numbered headers map directly to citation references in the response.

LLM generation (step 5)

  1. Prompt assembly — System prompt + formatted context + citation instructions + conversation history
  2. LLM routing — Select provider based on routing rules (default provider, scope override, or user override)
  3. Streaming — SSE chunks in OpenAI format forwarded to the client
  4. Citation mapping[N] references in the response text map to the numbered context chunks

Ingestion pipeline

How documents get from source systems into the evidence layer:

Chunking strategies

Content typeStrategyBehavior
Markdown / docsHeading-awareSplit at H1/H2/H3 boundaries, preserve section hierarchy
Code filesCode-awareSplit at function/class boundaries, keep imports with first chunk
HTML pagesHeading-awareExtract to markdown first, then heading-based chunking
PDFParagraph-basedParagraph boundaries (PDFs lack reliable heading structure)
Plain textFixed-sizeFixed-size with configurable overlap
JSON / structuredRecord-basedEach top-level record or array item becomes a chunk
TablesIntactTables kept whole when possible, headers repeated when split

Default chunk size: 512 tokens target, 1024 token maximum. Overlap: 50 tokens.

Query Explorer

Use the Query Explorer in the Admin Console to inspect the full pipeline for any query:

  • Score breakdown per chunk (vector, FTS, freshness, combined)
  • Side-by-side search mode comparison (vector-only vs FTS-only vs hybrid)
  • Which connections were included or excluded by scope
  • Context assembly trace (which chunks selected, filtered, deduplicated, budget-truncated)
  • Answerability assessment with confidence level

We use cookies for analytics to improve our website. More information in our Privacy Policy.