Skip to main content

LLM Providers

CoreCube routes answer generation through configured LLM models. A preset can pin its own answer model; otherwise CoreCube uses the active default model.

Supported providers

ProviderTypeEndpoint
Anthropic ClaudeManagedAnthropic API
OpenAIManagedOpenAI API
Google (Gemini)ManagedGoogle Gemini API
MistralManagedMistral API
Custom / Self-hostedOpenAI-compatibleAny /v1/chat/completions endpoint (Ollama, vLLM, LM Studio, etc.)

Adding a provider

  1. Navigate to Admin Console → LLM Providers
  2. Click New Provider
  3. Select the provider type
  4. Enter the API key and select a model
  5. Click Test to verify connectivity
  6. Save

For custom providers (Ollama, vLLM, etc.), enter the base URL of your endpoint:

ProviderBase URL example
Ollama (local)http://localhost:11434/v1
Ollama (Docker)http://ollama:11434/v1
vLLMhttp://vllm:8000/v1
LM Studiohttp://localhost:1234/v1

Routing

Default provider

One active provider is designated as the default. If a preset does not specify an answer model, chat requests use that default.

Preset answer model

Each preset can select an answer LLM in Configuration → Presets → open a preset → Pipeline → Answer LLM. That model overrides the global default for requests using the preset.

Provider rate limits

The model catalog stores conservative request-per-minute and token-per-minute limits for cloud models. When CC_PROVIDER_RATE_LIMITS_ENABLED=true (default), CoreCube checks those limits before calling external chat, reranking, embedding, and OCR providers. Exceeding a catalog limit returns a typed 429 instead of sending a request that the provider is likely to reject.

Fallback routing

Automatic fallback chains and rule-based provider routing are not part of the current runtime. Use separate presets when you need different answer models for different compartments or use cases.

Privacy routing

Assign sensitive compartments to a preset that uses a local-only answer model (Ollama, vLLM) to prevent organizational knowledge from reaching cloud APIs:

Compartment: executive, hr
Sensitivity: confidential, restricted
→ Assign preset with answer model: local-ollama (no external API call)

Provider health monitoring

The Admin Console shows real-time provider health:

MetricDescription
StatusConnected / degraded / offline
QueriesTotal queries routed to this provider
TokensInput and output tokens consumed
LatencyAverage response latency (p50, p95)
CostEstimated cost based on token counts
Error ratePercentage of failed requests

Usage stats

Navigate to Admin Console → LLM Providers to see per-provider usage broken down by time period (today, week, month).

Streaming

CoreCube forwards streaming responses from LLM providers to clients via Server-Sent Events (SSE) in the standard OpenAI format. Streaming is enabled by default and works with any OpenAI-compatible client.

To disable streaming for a specific request, set "stream": false in the request body.

We use cookies for analytics to improve our website. More information in our Privacy Policy.