Chat Completions

POST /v1/chat/completions

OpenAI-compatible chat completions endpoint with automatic knowledge retrieval and source citations. Any OpenAI client connects without modification.

Request

POST /v1/chat/completions
Authorization: Bearer cc_YOUR_API_KEY
Content-Type: application/json

{
  "model": "claude-sonnet-4-5",
  "messages": [
    {
      "role": "user",
      "content": "What do our deployment runbooks say about rollbacks?"
    }
  ],
  "stream": false
}

Supported fields

CoreCube accepts standard OpenAI chat completions fields; unrecognized fields (top_p, n, tools, multimodal content arrays, role: "tool", and others) are silently ignored. Only temperature and max_tokens are forwarded to the provider. Message content must be a string, and role is limited to system, user, or assistant. The accepted fields:

Field	Type	Default	Description
`model`	string	default provider model	Model name forwarded to the selected provider.
`messages`	array	required	Conversation history in OpenAI format
`stream`	boolean	`false`	Enable Server-Sent Events streaming
`temperature`	number	provider default	Sampling temperature forwarded to the provider
`max_tokens`	number	provider default	Maximum output tokens

CoreCube-specific fields

Field	Type	Default	Description
`cube_extended`	boolean	`false`	Enable extended mode — adds structured `cube` metadata to the response.
`connectionIds`	string[]	omitted	Narrow retrieval to specific connection IDs. Scope enforcement still applies.

Or use the header equivalent: X-Cube-Extended: true.

Response (strict mode)

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1744387200,
  "model": "claude-sonnet-4-5",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "According to the deployment runbooks, rollbacks should follow this process:\n\n1. Immediately stop the deployment if health checks fail [1]\n2. Run `make rollback VERSION=<previous>` to revert the container image [1]\n3. Notify the on-call team via PagerDuty [2]\n4. Document the incident in the postmortem template [2]\n\n---\n**Sources:**\n[1] Deployment Runbook — https://confluence.company.com/pages/1234\n[2] Incident Response Guide — https://confluence.company.com/pages/5678"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 1842,
    "completion_tokens": 127,
    "total_tokens": 1969
  }
}

Response (extended mode)

With X-Cube-Extended: true or "cube_extended": true:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "choices": [...],
  "usage": {...},
  "cube": {
    "citations": [
      {
        "index": 1,
        "title": "Deployment Runbook",
        "url": "https://confluence.company.com/pages/1234",
        "connectionName": "Confluence — Engineering",
        "sourceLabel": "Confluence — Engineering"
      },
      {
        "index": 2,
        "title": "Incident Response Guide",
        "url": "https://confluence.company.com/pages/5678",
        "connectionName": "Confluence — Engineering",
        "sourceLabel": "Confluence — Engineering"
      }
    ],
    "attachments": [],
    "searchLatencyMs": 142,
    "llmLatencyMs": 1840,
    "chunksRetrieved": 8,
    "filesRetrieved": 0,
    "degraded": false,
    "emptyReason": null,
    "degradations": []
  }
}

Streaming

{ "stream": true }

Responses are streamed as Server-Sent Events in the standard OpenAI format:

data: {"id":"chatcmpl-abc","choices":[{"delta":{"role":"assistant"},"index":0}]}
data: {"id":"chatcmpl-abc","choices":[{"delta":{"content":"According"},"index":0}]}
data: {"id":"chatcmpl-abc","choices":[{"delta":{"content":" to"},"index":0}]}
...
data: [DONE]

In extended mode, a cube.metadata frame is sent before [DONE]:

data: {"object":"cube.metadata","cube":{"citations":[...],"searchLatencyMs":142,"chunksRetrieved":8}}
data: [DONE]

Streaming currently does not support presets that expose LLM-callable retrieval or action tools. Use non-streaming chat for those presets.

Rate limits and concurrency

CoreCube can return 429 before provider work starts:

Cause	Response
API-key request-per-minute limit exceeded	`429` with the API-key rate-limit error.
Chat concurrency limit exceeded	`429`, code `CONCURRENCY_LIMIT_EXCEEDED`, `Retry-After: 1`.
Provider RPM/TPM catalog limit exceeded	`429`, code `provider_rate_limited`, `Retry-After` header.

Default chat concurrency is 100 in-flight chat requests per server instance, 100 per API key, 8 per effective user, and 8 per source IP for public keys. See Environment Variables.

Insufficient evidence

When retrieval finds no relevant context (strict answerability mode):

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "I don't have enough information in the knowledge base to answer this question. The available sources don't contain relevant information about this topic."
      },
      "finish_reason": "stop"
    }
  ]
}

The query is not forwarded to the LLM in strict mode — only the knowledge base is searched.

Examples

cURL

curl https://corecube.your-domain.com/v1/chat/completions \
  -H "Authorization: Bearer cc_YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Who owns the authentication service?"}],
    "stream": false
  }'

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    api_key="cc_YOUR_API_KEY",
    base_url="https://corecube.your-domain.com/v1"
)

response = client.chat.completions.create(
    model="claude-sonnet-4-5",
    messages=[{"role": "user", "content": "What is our data retention policy?"}]
)

print(response.choices[0].message.content)

JavaScript (OpenAI SDK)

import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: 'cc_YOUR_API_KEY',
  baseURL: 'https://corecube.your-domain.com/v1',
});

const response = await client.chat.completions.create({
  model: 'claude-sonnet-4-5',
  messages: [{ role: 'user', content: 'What are our deployment procedures?' }],
});

console.log(response.choices[0].message.content);

Service key (per-user permissions)

curl https://corecube.your-domain.com/v1/chat/completions \
  -H "Authorization: Bearer cc_SERVICE_API_KEY" \
  -H "X-Cube-User: sarah@company.com" \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Show me the Q4 budget projections"}]}'

Public key (anonymous access)

Public keys require no user identity. Any X-Cube-User header is ignored — the bound scope always applies.

curl https://corecube.your-domain.com/v1/chat/completions \
  -H "Authorization: Bearer cc_PUBLIC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "How do I install CoreCube?"}]}'

Browser example for an embedded chat widget:

const response = await fetch('https://corecube.your-domain.com/v1/chat/completions', {
  method: 'POST',
  headers: {
    Authorization: 'Bearer cc_PUBLIC_API_KEY',
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    messages: [{ role: 'user', content: userQuestion }],
    stream: true,
  }),
});

Configure CORS on your reverse proxy to allow requests from the widget's origin. Public keys do not write per-key CORS headers — CoreCube expects the proxy to handle preflight.

Model listing

GET /v1/models
Authorization: Bearer cc_YOUR_API_KEY

Returns a single model entry whose id is the instance's app name (default CoreCube), in OpenAI format. Clients like OpenWebUI use this for automatic model discovery — they simply see one model.

Other `/v1` endpoints

Two additional resources are exposed under /v1. They are primarily internal to CoreCube's own clients, but are reachable with a valid API key.

Endpoint	Purpose
`GET /v1/tools`	List the tools available to the calling key.
`GET /v1/tools/:slug`	Retrieve a single tool by slug.
`POST /v1/tools/:slug/invoke`	Invoke a tool. Gated by per-tool authorization — denied with `403 tool_permission_denied`.
`GET /v1/documents/:id/file`	Fetch a cited source file via a signed citation token. Extended-mode `cube.attachments[].downloadUrl` values resolve here.

Request​

Supported fields​

CoreCube-specific fields​

Response (strict mode)​

Response (extended mode)​

Streaming​

Rate limits and concurrency​

Insufficient evidence​

Examples​

cURL​

Python (OpenAI SDK)​

JavaScript (OpenAI SDK)​

Service key (per-user permissions)​

Public key (anonymous access)​

Model listing​

Other /v1 endpoints​