Skip to main content

Chat Completions

POST /v1/chat/completions

OpenAI-compatible chat completions endpoint with automatic knowledge retrieval and source citations. Any OpenAI client connects without modification.

Request

POST /v1/chat/completions
Authorization: Bearer cc_YOUR_API_KEY
Content-Type: application/json
{
"model": "claude-sonnet-4-5",
"messages": [
{
"role": "user",
"content": "What do our deployment runbooks say about rollbacks?"
}
],
"stream": false
}

Supported fields

CoreCube accepts standard OpenAI chat completions fields; unrecognized fields (top_p, n, tools, multimodal content arrays, role: "tool", and others) are silently ignored. Only temperature and max_tokens are forwarded to the provider. Message content must be a string, and role is limited to system, user, or assistant. The accepted fields:

FieldTypeDefaultDescription
modelstringdefault provider modelModel name forwarded to the selected provider.
messagesarrayrequiredConversation history in OpenAI format
streambooleanfalseEnable Server-Sent Events streaming
temperaturenumberprovider defaultSampling temperature forwarded to the provider
max_tokensnumberprovider defaultMaximum output tokens

CoreCube-specific fields

FieldTypeDefaultDescription
cube_extendedbooleanfalseEnable extended mode — adds structured cube metadata to the response.
connectionIdsstring[]omittedNarrow retrieval to specific connection IDs. Scope enforcement still applies.

Or use the header equivalent: X-Cube-Extended: true.

Response (strict mode)

{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1744387200,
"model": "claude-sonnet-4-5",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "According to the deployment runbooks, rollbacks should follow this process:\n\n1. Immediately stop the deployment if health checks fail [1]\n2. Run `make rollback VERSION=<previous>` to revert the container image [1]\n3. Notify the on-call team via PagerDuty [2]\n4. Document the incident in the postmortem template [2]\n\n---\n**Sources:**\n[1] Deployment Runbook — https://confluence.company.com/pages/1234\n[2] Incident Response Guide — https://confluence.company.com/pages/5678"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 1842,
"completion_tokens": 127,
"total_tokens": 1969
}
}

Response (extended mode)

With X-Cube-Extended: true or "cube_extended": true:

{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"choices": [...],
"usage": {...},
"cube": {
"citations": [
{
"index": 1,
"title": "Deployment Runbook",
"url": "https://confluence.company.com/pages/1234",
"connectionName": "Confluence — Engineering",
"sourceLabel": "Confluence — Engineering"
},
{
"index": 2,
"title": "Incident Response Guide",
"url": "https://confluence.company.com/pages/5678",
"connectionName": "Confluence — Engineering",
"sourceLabel": "Confluence — Engineering"
}
],
"attachments": [],
"searchLatencyMs": 142,
"llmLatencyMs": 1840,
"chunksRetrieved": 8,
"filesRetrieved": 0,
"degraded": false,
"emptyReason": null,
"degradations": []
}
}

Streaming

{ "stream": true }

Responses are streamed as Server-Sent Events in the standard OpenAI format:

data: {"id":"chatcmpl-abc","choices":[{"delta":{"role":"assistant"},"index":0}]}
data: {"id":"chatcmpl-abc","choices":[{"delta":{"content":"According"},"index":0}]}
data: {"id":"chatcmpl-abc","choices":[{"delta":{"content":" to"},"index":0}]}
...
data: [DONE]

In extended mode, a cube.metadata frame is sent before [DONE]:

data: {"object":"cube.metadata","cube":{"citations":[...],"searchLatencyMs":142,"chunksRetrieved":8}}
data: [DONE]

Streaming currently does not support presets that expose LLM-callable retrieval or action tools. Use non-streaming chat for those presets.

Rate limits and concurrency

CoreCube can return 429 before provider work starts:

CauseResponse
API-key request-per-minute limit exceeded429 with the API-key rate-limit error.
Chat concurrency limit exceeded429, code CONCURRENCY_LIMIT_EXCEEDED, Retry-After: 1.
Provider RPM/TPM catalog limit exceeded429, code provider_rate_limited, Retry-After header.

Default chat concurrency is 100 in-flight chat requests per server instance, 100 per API key, 8 per effective user, and 8 per source IP for public keys. See Environment Variables.

Insufficient evidence

When retrieval finds no relevant context (strict answerability mode):

{
"choices": [
{
"message": {
"role": "assistant",
"content": "I don't have enough information in the knowledge base to answer this question. The available sources don't contain relevant information about this topic."
},
"finish_reason": "stop"
}
]
}

The query is not forwarded to the LLM in strict mode — only the knowledge base is searched.

Examples

cURL

curl https://corecube.your-domain.com/v1/chat/completions \
-H "Authorization: Bearer cc_YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Who owns the authentication service?"}],
"stream": false
}'

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
api_key="cc_YOUR_API_KEY",
base_url="https://corecube.your-domain.com/v1"
)

response = client.chat.completions.create(
model="claude-sonnet-4-5",
messages=[{"role": "user", "content": "What is our data retention policy?"}]
)

print(response.choices[0].message.content)

JavaScript (OpenAI SDK)

import OpenAI from 'openai';

const client = new OpenAI({
apiKey: 'cc_YOUR_API_KEY',
baseURL: 'https://corecube.your-domain.com/v1',
});

const response = await client.chat.completions.create({
model: 'claude-sonnet-4-5',
messages: [{ role: 'user', content: 'What are our deployment procedures?' }],
});

console.log(response.choices[0].message.content);

Service key (per-user permissions)

curl https://corecube.your-domain.com/v1/chat/completions \
-H "Authorization: Bearer cc_SERVICE_API_KEY" \
-H "X-Cube-User: sarah@company.com" \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Show me the Q4 budget projections"}]}'

Public key (anonymous access)

Public keys require no user identity. Any X-Cube-User header is ignored — the bound scope always applies.

curl https://corecube.your-domain.com/v1/chat/completions \
-H "Authorization: Bearer cc_PUBLIC_API_KEY" \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "How do I install CoreCube?"}]}'

Browser example for an embedded chat widget:

const response = await fetch('https://corecube.your-domain.com/v1/chat/completions', {
method: 'POST',
headers: {
Authorization: 'Bearer cc_PUBLIC_API_KEY',
'Content-Type': 'application/json',
},
body: JSON.stringify({
messages: [{ role: 'user', content: userQuestion }],
stream: true,
}),
});

Configure CORS on your reverse proxy to allow requests from the widget's origin. Public keys do not write per-key CORS headers — CoreCube expects the proxy to handle preflight.

Model listing

GET /v1/models
Authorization: Bearer cc_YOUR_API_KEY

Returns a single model entry whose id is the instance's app name (default CoreCube), in OpenAI format. Clients like OpenWebUI use this for automatic model discovery — they simply see one model.

Other /v1 endpoints

Two additional resources are exposed under /v1. They are primarily internal to CoreCube's own clients, but are reachable with a valid API key.

EndpointPurpose
GET /v1/toolsList the tools available to the calling key.
GET /v1/tools/:slugRetrieve a single tool by slug.
POST /v1/tools/:slug/invokeInvoke a tool. Gated by per-tool authorization — denied with 403 tool_permission_denied.
GET /v1/documents/:id/fileFetch a cited source file via a signed citation token. Extended-mode cube.attachments[].downloadUrl values resolve here.

We use cookies for analytics to improve our website. More information in our Privacy Policy.