LLM Providers
CoreCube routes answer generation through configured LLM models. A preset can pin its own answer model; otherwise CoreCube uses the active default model.
Supported providers
| Provider | Type | Endpoint |
|---|---|---|
| Anthropic Claude | Managed | Anthropic API |
| OpenAI | Managed | OpenAI API |
| Google (Gemini) | Managed | Google Gemini API |
| Mistral | Managed | Mistral API |
| Custom / Self-hosted | OpenAI-compatible | Any /v1/chat/completions endpoint (Ollama, vLLM, LM Studio, etc.) |
Adding a provider
- Navigate to Admin Console → LLM Providers
- Click New Provider
- Select the provider type
- Enter the API key and select a model
- Click Test to verify connectivity
- Save
For custom providers (Ollama, vLLM, etc.), enter the base URL of your endpoint:
| Provider | Base URL example |
|---|---|
| Ollama (local) | http://localhost:11434/v1 |
| Ollama (Docker) | http://ollama:11434/v1 |
| vLLM | http://vllm:8000/v1 |
| LM Studio | http://localhost:1234/v1 |
Routing
Default provider
One active provider is designated as the default. If a preset does not specify an answer model, chat requests use that default.
Preset answer model
Each preset can select an answer LLM in Configuration → Presets → open a preset → Pipeline → Answer LLM. That model overrides the global default for requests using the preset.
Provider rate limits
The model catalog stores conservative request-per-minute and token-per-minute limits for cloud
models. When CC_PROVIDER_RATE_LIMITS_ENABLED=true (default), CoreCube checks those limits before
calling external chat, reranking, embedding, and OCR providers. Exceeding a catalog limit returns a
typed 429 instead of sending a request that the provider is likely to reject.
Automatic fallback chains and rule-based provider routing are not part of the current runtime. Use separate presets when you need different answer models for different compartments or use cases.
Privacy routing
Assign sensitive compartments to a preset that uses a local-only answer model (Ollama, vLLM) to prevent organizational knowledge from reaching cloud APIs:
Compartment: executive, hr
Sensitivity: confidential, restricted
→ Assign preset with answer model: local-ollama (no external API call)
Provider health monitoring
The Admin Console shows real-time provider health:
| Metric | Description |
|---|---|
| Status | Connected / degraded / offline |
| Queries | Total queries routed to this provider |
| Tokens | Input and output tokens consumed |
| Latency | Average response latency (p50, p95) |
| Cost | Estimated cost based on token counts |
| Error rate | Percentage of failed requests |
Usage stats
Navigate to Admin Console → LLM Providers to see per-provider usage broken down by time period (today, week, month).
Streaming
CoreCube forwards streaming responses from LLM providers to clients via Server-Sent Events (SSE) in the standard OpenAI format. Streaming is enabled by default and works with any OpenAI-compatible client.
To disable streaming for a specific request, set "stream": false in the request body.