Skip to main content

Retrieval Pipeline

CoreCube's retrieval pipeline transforms a user query into a grounded, cited answer by composing the active preset pipeline, running scoped hybrid search, applying query/chunk tools, and assembling context before passing anything to an LLM.

Pipeline overview

Preset resolution (step 0)

Every query runs through the active preset binding:

  1. Resolve the caller's organization or compartment preset.
  2. Load the binding's resolved pipeline snapshot.
  3. Read the preset answer model and tool-loop bounds.
  4. Use the preset's default retrieval tool as the source of query-time retrieval settings.

The snapshot contains query tools, default retrieval, optional LLM-callable retrieval/action capabilities, chunk tools, and prompt fragments grouped by stage. This is why two presets can use the same corpus but behave differently: the retrieval profile, tool chain, answer model, and prompt composition are all part of the preset.

Query processing (step 1)

When a query arrives at /v1/chat/completions:

  1. Admission control — API-key rate limits, chat concurrency limits, and provider catalog limits protect the server from overload.
  2. Query extraction — The last user message is extracted for retrieval; full conversation history still goes to the answer LLM.
  3. Scope resolution — API key → effective user → allowed connections/scopes (compartments + sensitivity levels).
  4. Query tools — Active query tools can rewrite the query into semantic and keyword variants.
  5. Query embedding — The query text is embedded using the same model family as document chunks.

Hybrid search (step 2)

The default retrieval tool controls the dense pool, sparse pool, fusion weights, score floor, freshness floor, HNSW ef_search, reranker model, rerank pool, and final top-K. Two search legs run in parallel and are fused:

Full-text search uses PostgreSQL's websearch_to_tsquery with weighted tsvectors:

  • Weight A — document title (highest signal)
  • Weight B — heading path (e.g., "Deployment > Staging > Prerequisites")
  • Weight C — chunk body content

Vector search uses pgvector's HNSW index for approximate nearest neighbor with cosine distance.

RRF fusion combines both ranked lists. Documents appearing in both lists are ranked higher than documents in only one list. The default presets use rrf_k = 60, with vector and FTS stream weights stored on the retrieval tool.

Ranking and filtering (step 3)

The fused candidate list passes through ranking and filtering:

StageDescription
Freshness decayExponential decay based on time since last sync. Configurable half-life. Recently synced content ranks higher.
Quality filterExclude boilerplate chunks (navigation menus, footers, repeated disclaimers) below a quality score threshold
ACL filterEnforce scope — remove any chunks from connections outside the query's allowed compartments/sensitivity
Cross-encoder rerankerScore each (query, chunk) pair jointly. Candidate pool, final top-K, enabled flag, and model come from the retrieval tool. Graceful fallback to fusion ranking when reranking is disabled or unavailable.
Chunk toolsOptional preset tools can re-score, filter, or expand the candidate list before answer generation.

Hybrid ranking weights

SignalDefault source
Vector similarityRetrieval tool dense stream
Full-text relevanceRetrieval tool sparse stream
Freshness boostRetrieval tool + freshness_decay chunk tool

When scores are equal (within 0.01 tolerance), source trust level breaks the tie: authoritative > reference > volatile.

Context assembly (step 4)

Context is formatted as:

### [1] Deployment Runbook — Confluence — Engineering
...chunk content...

### [2] Incident Response Guide — Confluence — Engineering
...chunk content...

The numbered headers map directly to citation references in the response.

LLM generation (step 5)

  1. Prompt assembly — Answer-stage fragments + active tool prompts + invariant footer + formatted context + conversation history.
  2. Answer model selection — The active preset's answer model is used when set; otherwise the default LLM provider is selected.
  3. Tool-call loop — Non-streaming requests can let the answer LLM call retrieval or action capabilities attached to the preset.
  4. Streaming — SSE chunks in OpenAI format are forwarded to the client. Extended streaming adds a cube.metadata frame before [DONE].
  5. Citation mapping[N] references in the response text map to the numbered context chunks.

Ingestion pipeline

How documents get from source systems into the evidence layer:

Ingestion is idempotent. Changed documents replace chunks through an atomic chunk replacement path, unchanged external documents can repair missing chunk state, and cross-source content deduplication does not reuse another connection's storage key.

Chunking strategies

Content typeStrategyBehavior
Markdown / docsHeading-awareSplit at H1/H2/H3 boundaries, preserve section hierarchy
Code filesCode-awareSplit at function/class boundaries, keep imports with first chunk
HTML pagesHeading-awareExtract to markdown first, then heading-based chunking
PDFParagraph-basedParagraph boundaries (PDFs lack reliable heading structure)
Plain textFixed-sizeFixed-size with configurable overlap
JSON / structuredRecord-basedEach top-level record or array item becomes a chunk
TablesIntactTables kept whole when possible, headers repeated when split

Default chunk size: 512 tokens target, 1024 token maximum. Overlap: 50 tokens.

Query Explorer

Use the Query Explorer in the Admin Console to inspect the full pipeline for any query:

  • Score breakdown per chunk (vector, FTS, freshness, combined)
  • Active preset, retrieval tool, query tools, and chunk tools
  • Side-by-side search mode comparison (vector-only vs FTS-only vs hybrid)
  • Which connections were included or excluded by scope
  • Context assembly trace (which chunks selected, filtered, deduplicated, budget-truncated)
  • Answerability assessment with confidence level

We use cookies for analytics to improve our website. More information in our Privacy Policy.