Skip to main content

Chunking

Chunking is the input side of embedding. Every chunk the splitter emits becomes one row in chunk_embeddings and one candidate during retrieval. The chunk you produce at ingest time is the unit you can recall — the splitter therefore decides what the retriever can and cannot find.

The classic failure modes:

  • Chunks too small → meaning fragments across rows; the retriever returns half a thought.
  • Chunks too large → similarity signal dilutes against unrelated prose; the right chunk is buried in noise.
  • Chunks split mid-sentence → key phrases never appear intact in any single chunk; the retriever misses both halves.

Every field on this page is a lever against one of those failures.

Changes apply to new ingests only

Chunk fields take effect on newly ingested content. Existing chunks are not silently rewritten — CoreCube never re-chunks behind your back. Re-process a connection or re-upload a document to apply the new shape.


Strategy

How CoreCube chooses where one chunk ends and the next begins.

StrategyBoundary signalBest for
headingMarkdown headings (#, ##, ###, ...)Confluence pages, README files, runbooks, structured PDFs. Falls back to paragraphs when no headings exist — safe default.
paragraphBlank-line paragraph breaksLong-form articles, policy documents, transcribed prose, legal text.
fixedRolling window of chunking_chunk_size tokensCode, server logs, plain-text dumps without reliable structure.

Examples

A wiki page with this shape:

## Deployment

We deploy via Helm. ...

### Staging

The staging cluster is in eu-west-1. ...

### Production

Production runs in us-east-2. ...
  • heading produces three chunks: ## Deployment body, ### Staging body, ### Production body. Each chunk is a self-contained section; section titles can optionally be embedded as context (see Late chunking).
  • paragraph produces one chunk per paragraph, ignoring the heading hierarchy. Headings end up bundled with whichever paragraph follows.
  • fixed ignores the structure entirely — boundaries fall wherever the token counter hits the target size, regardless of mid-sentence or mid-section.

For a server log:

2025-01-15 09:23:01 INFO request received
2025-01-15 09:23:01 DEBUG payload size 4096 bytes
2025-01-15 09:23:02 ERROR upstream timeout

There are no headings or paragraphs, so heading and paragraph collapse to one giant chunk. fixed is the right call — slice into uniform windows so the retriever can find specific log spans.


Chunk size

The target chunk length, in tokens, that the splitter aims for.

A token is roughly 4 characters or 0.75 words in English — 512 tokens is a paragraph or two of dense prose, ~2000 characters.

RangeBehavior
Small (128–256)Sharp recall — the chunk is small enough that a similarity match means the topic is centrally present. But meaning often splits across adjacent chunks.
Medium (384–768, default 512)Balanced — enough room for a paragraph or two to stand on its own as one coherent unit.
Large (1024–2048)More context per hit, fewer chunks total, less retrieval granularity. Useful for dense reference material where most queries want the whole section.

The bounds enforced on this field follow the Min/Max chunk size values in Advanced — the form prevents you from setting the target outside the allowed range.

Worked example

A 4,000-token document chunked at chunk_size = 512 with chunk_overlap = 50 produces approximately:

ceil(4000 / (512 - 50)) = 9 chunks

Total tokens stored: 9 × 512 = 4,608 (~15% overhead from overlap).


Chunk overlap

Tokens carried over from the end of one chunk into the start of the next.

Overlap exists to solve the boundary problem: a sentence or fact split across two chunks would otherwise appear in neither, because each half is too short to similarity-match the full query. With overlap, every span of length chunk_overlap appears in at least one chunk intact.

OverlapEffect
0No overlap — minimum storage cost. Any phrase split at a chunk boundary is lost to retrieval.
10–20% of chunk size (recommended)Standard practice. For a 512-token chunk, an overlap of 50–100 captures most boundary phrases.
> 30% of chunk sizeExcessive — index grows substantially without commensurate recall gain. Almost always a sign the chunk size is wrong.

Constraint: chunking_chunk_overlap < chunking_chunk_size is enforced at validation. A chunk cannot overlap more than itself.

Worked example

chunk_size = 512, chunk_overlap = 50 on a 1,500-token document:

chunk 1: tokens 0 – 511
chunk 2: tokens 462 – 973 (50 overlap with chunk 1)
chunk 3: tokens 924 – 1435 (50 overlap with chunk 2)
chunk 4: tokens 1386 – 1499 (short tail; merged into chunk 3 if below min size)

Min chunk size

The smallest chunk the splitter will emit. Chunks shorter than this are merged into the next one.

Without a floor, the splitter can produce tiny trailing chunks — a lone heading, a single bullet, the end of a section — that waste an embedding row and rarely contribute useful signal. Most retrieval rankings will surface them only by accident.

Default 64 tokens is roughly two short sentences — small enough that legitimate short sections survive, large enough that lone headings get folded into their bodies.


Max chunk size

The hard ceiling on chunk length. Chunks longer than this are forcibly split, even mid-paragraph.

This is a safety valve, not a tuning knob. Pathological inputs — a 50,000-token blob without paragraph breaks, a wall of legal boilerplate, an OCR'd page with no structure — would otherwise emit one giant chunk that exceeds the embedding model's context window and gets silently truncated by the provider.

The ceiling guarantees every chunk fits inside the embedding model's input limit with margin to spare. Default 1024 tokens is conservative for current models (most accept 8k+ input).

Constraint: min_chunk_size < chunk_size ≤ max_chunk_size. The form blocks saves that violate the ordering.


Late chunking

Embed the whole document with a long-context encoder, then derive each chunk's vector from its token span. Coming soon — value is stored but ignored by the runtime today.

Standard chunking embeds each chunk in isolation: the model sees only the chunk's own tokens. Late chunking flips the order — the model sees the whole document first, builds contextualized token representations across the full text, and then each chunk's vector is the mean of its own token vectors from that pass.

The result: a chunk's vector reflects its surrounding context. A sentence like "We discontinued it in 2023" inherits semantic context about what was discontinued from the surrounding paragraphs. Without late chunking, that sentence is a useless vector — it could refer to anything.

When it matters: documents where individual chunks rely heavily on surrounding context to be meaningful — meeting transcripts, legal documents, code with cross-references.

Cost: requires a long-context encoder (jina-embeddings-v3, nomic-embed-v2-text, etc.) and one full-document encoder pass per ingest. Roughly 2–5× more compute than standard chunking; storage cost unchanged.


Contextual chunking

Prepend an LLM-generated summary of where the chunk sits in the document, then embed. Coming soon — value is stored but ignored by the runtime today.

Anthropic's contextual retrieval technique: before embedding each chunk, ask an LLM to write a one-paragraph context summary (e.g. "This chunk is from the Q3 2024 earnings call; the speaker is the CFO discussing data-center revenue.") and prepend it to the chunk text.

The chunk's vector then reflects both what the chunk says and where it lives. A chunk about "data-center revenue" embedded with that prefix matches a query about Q3 earnings data-center segment even when neither phrase appears in the chunk body.

When it matters: chunks that lose meaning in isolation — a table cell, a code snippet, a specification clause that references "the above" or "the system".

Cost: one LLM call per chunk at ingest time. For a corpus with 1M chunks at $0.001 per call, that is $1,000 of one-off compute — usually worth it for high-value enterprise corpora, often not for log dumps.

Late chunking and contextual chunking are not redundant

Late chunking gives the chunk an embedding informed by surrounding token activations. Contextual chunking gives it an embedding informed by an explicit natural-language summary. Different mechanisms, different trade-offs. Late chunking is cheaper at scale; contextual is more interpretable.


Cross-field rules

The validator enforces:

chunking_min_chunk_size < chunking_chunk_size ≤ chunking_max_chunk_size
chunking_chunk_overlap < chunking_chunk_size

The form surfaces these inline next to the offending field; the API rejects violating saves outright at the validation boundary.


Setting reference

Setting keyTypeDefaultRange
chunking_strategy'heading' | 'paragraph' | 'fixed''heading'
chunking_chunk_sizeinteger (tokens)512642048
chunking_chunk_overlapinteger (tokens)500256
chunking_min_chunk_sizeinteger (tokens)6416256
chunking_max_chunk_sizeinteger (tokens)10242564096
chunking_late_enabledboolean (Coming soon)false
chunking_contextual_enabledboolean (Coming soon)false

We use cookies for analytics to improve our website. More information in our Privacy Policy.