LLM Providers

CoreCube routes answer generation through configured LLM models. A preset can pin its own answer model; otherwise CoreCube uses the active default model.

Supported providers

Provider	Type	Endpoint
Anthropic Claude	Managed	Anthropic API
OpenAI	Managed	OpenAI API
Google (Gemini)	Managed	Google Gemini API
Mistral	Managed	Mistral API
Custom / Self-hosted	OpenAI-compatible	Any `/v1/chat/completions` endpoint (Ollama, vLLM, LM Studio, etc.)

Adding a provider

Navigate to Admin Console → LLM Providers
Click New Provider
Select the provider type
Enter the API key and select a model
Click Test to verify connectivity
Save

For custom providers (Ollama, vLLM, etc.), enter the base URL of your endpoint:

Provider	Base URL example
Ollama (local)	`http://localhost:11434/v1`
Ollama (Docker)	`http://ollama:11434/v1`
vLLM	`http://vllm:8000/v1`
LM Studio	`http://localhost:1234/v1`

Routing

Default provider

One active provider is designated as the default. If a preset does not specify an answer model, chat requests use that default.

Preset answer model

Each preset can select an answer LLM in Configuration → Presets → open a preset → Pipeline → Answer LLM. That model overrides the global default for requests using the preset.

Provider rate limits

The model catalog stores conservative request-per-minute and token-per-minute limits for cloud models. When CC_PROVIDER_RATE_LIMITS_ENABLED=true (default), CoreCube checks those limits before calling external chat, reranking, embedding, and OCR providers. Exceeding a catalog limit returns a typed 429 instead of sending a request that the provider is likely to reject.

Fallback routing

Automatic fallback chains and rule-based provider routing are not part of the current runtime. Use separate presets when you need different answer models for different compartments or use cases.

Privacy routing

Assign sensitive compartments to a preset that uses a local-only answer model (Ollama, vLLM) to prevent organizational knowledge from reaching cloud APIs:

Compartment: executive, hr
Sensitivity: confidential, restricted
→ Assign preset with answer model: local-ollama (no external API call)

Provider health monitoring

The Admin Console shows real-time provider health:

Metric	Description
Status	Connected / degraded / offline
Queries	Total queries routed to this provider
Tokens	Input and output tokens consumed
Latency	Average response latency (p50, p95)
Cost	Estimated cost based on token counts
Error rate	Percentage of failed requests

Usage stats

Navigate to Admin Console → LLM Providers to see per-provider usage broken down by time period (today, week, month).

Streaming

CoreCube forwards streaming responses from LLM providers to clients via Server-Sent Events (SSE) in the standard OpenAI format. Streaming is enabled by default and works with any OpenAI-compatible client.

To disable streaming for a specific request, set "stream": false in the request body.

Supported providers​

Adding a provider​

Routing​

Default provider​

Preset answer model​

Provider rate limits​

Privacy routing​

Provider health monitoring​

Usage stats​

Streaming​