Skip to content

AI Model Deployment Guide

Self-Managed Kindo does not ship with AI models — you provide the inference infrastructure. This means you are responsible for deploying, configuring, and maintaining the models that power chat, agents, workflows, embeddings, and every other AI-driven feature in Kindo.

This guide helps you get model deployment right the first time. Whether you are connecting cloud API providers (OpenAI, Anthropic, Google), self-hosting models with vLLM, or mixing both — the principles here will save you hours of debugging.

Before You Start

Before deploying any model, answer these questions:

  1. What workloads will this model serve? Chat, agents, embeddings, transcription, or a combination? (See Minimum Model Requirements for the baseline.)
  2. Cloud API, self-hosted, or both? Cloud APIs are simpler to configure. Self-hosted models give you full control but require GPU infrastructure and careful tuning.
  3. What is the model’s full context window? Not what you think it is — what the official documentation says. Getting this wrong is the #1 deployment mistake we see.
  4. Does the model support tool/function calling? Kindo agents rely on tool calling. If your inference server needs a flag to enable it, you must set it.

Cloud API Models

Connecting cloud-hosted models (OpenAI, Anthropic, Google, Azure OpenAI, Groq) is the simplest path. You need an API key and the correct endpoint — no GPU infrastructure required.

Setup Checklist

  • API key obtained from the provider (OpenAI, Anthropic, Google AI Studio, Azure, etc.)
  • Endpoint URL confirmed — use the provider’s documented base URL, not a guess
  • Model name verified — use the exact model identifier the provider expects (e.g., gpt-4o, claude-sonnet-4-20250514, gemini-2.5-pro)
  • Context window set correctly — match the provider’s documented limit for the specific model version
  • Rate limits understood — know your tier’s tokens-per-minute and requests-per-minute limits
  • Network access confirmed — your cluster can reach the provider’s API endpoint on port 443

Adding a Cloud Model to Kindo

Use the model management API to register the model. The key fields:

FieldWhat to setCommon mistake
contextWindowThe model’s full advertised context windowSetting it too low (e.g., 8,096 when the model supports 128,000)
litellmModelNameA unique name for routingUsing a name that conflicts with an existing model
litellmParams.modelProvider-prefixed model ID (see prefix table below)Wrong prefix or outdated model version
litellmParams.api_keyYour provider API keyKey for wrong environment or expired key
litellmParams.api_baseProvider endpoint (only if non-default)Setting this when it should be omitted for standard providers

Provider prefix reference for litellmParams.model:

ProviderPrefixExample
OpenAIopenai/openai/gpt-4o
Anthropicanthropic/anthropic/claude-sonnet-4-20250514
Google Geminigemini/gemini/gemini-2.5-pro
Azure OpenAIazure/azure/<your-deployment-name>
AWS Bedrockbedrock/bedrock/us.anthropic.claude-sonnet-4-20250514-v1:0
Self-hosted (vLLM)openai/openai/llama-4-scout

For the full API call, see Add a Global Model.

Provider-Specific Notes

OpenAI / Azure OpenAI:

  • Azure requires api_base, api_version, and api_key specific to your Azure deployment
  • Model names differ between OpenAI and Azure (e.g., gpt-4o vs your Azure deployment name)

Anthropic:

  • Use anthropic/ prefix in litellmParams.model (e.g., anthropic/claude-sonnet-4-20250514)
  • Supports up to 200k context window — set it accordingly

Google (Gemini):

  • Use gemini/ prefix (e.g., gemini/gemini-2.5-pro)
  • Gemini 2.5 Pro supports 1M tokens — do not cap it artificially

AWS Bedrock:

  • Use bedrock/ prefix in litellmParams.model (e.g., bedrock/us.anthropic.claude-sonnet-4-20250514-v1:0)
  • Models must be explicitly enabled in the AWS Console under Bedrock → Model access before they can be used
  • Use cross-region inference profiles for better availability (e.g., us.anthropic.claude-* instead of region-specific ARNs)
  • IAM permissions required: aws-marketplace:ViewSubscriptions and aws-marketplace:Subscribe in addition to standard bedrock:InvokeModel permissions
  • Ensure your AWS credentials (access key, secret key, region) are configured in litellmParams

Self-Hosted Models

Self-hosted models give you complete control over data residency and model selection, but they require careful configuration. Most deployment issues we see come from incorrect inference server configuration, not from Kindo itself.

Pre-Deployment Checklist

Before you start any self-hosted model deployment:

  • Hardware verified — GPUs have enough VRAM for the model at your target context length (see GPU sizing below)
  • NVIDIA drivers and container runtime installednvidia-smi shows your GPUs, nvidia-container-runtime is configured
  • Model weights downloaded — access tokens set, weights pulled successfully to local storage
  • Official model documentation read — you know the recommended context window, quantization, and any special flags
  • Context window sized correctly — set to the model’s full supported context length (or the maximum your hardware can serve)
  • Tool-call parser identified — if the model supports tool calling, you know which --tool-call-parser flag to use
  • DNS/hostname planned — the inference endpoint will be reachable from your Kubernetes cluster

GPU Sizing

The VRAM required depends on the model size, quantization, and the context length you want to serve. Context length is the primary variable — a 70B model serving 4k context needs far less VRAM than the same model serving 128k context due to KV-cache memory.

Model SizeQuantizationShort Context (4k)Medium Context (32k)Full Context (128k+)
7–8BFP16/BF161× 24GB GPU1× 24GB GPU1× 48GB GPU
7–8BFP81× 24GB GPU1× 24GB GPU1× 24GB GPU
13BFP16/BF161× 48GB GPU1× 80GB GPU1× 80GB GPU
13BFP81× 24GB GPU1× 48GB GPU1× 80GB GPU
30–34BFP16/BF161× 80GB GPU2× 80GB GPU2–4× 80GB GPU
30–34BFP81× 48GB GPU1× 80GB GPU1–2× 80GB GPU
70BFP16/BF162× 80GB GPU4× 80GB GPU4–8× 80GB GPU
70BFP81× 80GB GPU2× 80GB GPU2–4× 80GB GPU
70BAWQ/GPTQ1× 80GB GPU2× 80GB GPU4× 80GB GPU

vLLM Configuration

vLLM is the most common inference server for self-hosted models with Kindo. Getting the configuration right is critical.

Essential Flags

Every vLLM deployment should consider these flags:

Terminal window
vllm serve <model-id> \
--served-model-name <name> # Name Kindo will use to route requests
--port <port> # Port to serve on
--max-model-len <context-length> # CRITICAL: Set to the model's full context window
--dtype bfloat16 # Use BF16 for modern GPUs (H100, A100, B200)
--tensor-parallel-size <num-gpus> # Number of GPUs for tensor parallelism
--tool-call-parser <parser> # CRITICAL for agents: enables tool/function calling
--enable-auto-tool-choice # Let vLLM automatically route tool calls
--enable-prefix-caching # Improves performance for repeated prompts
--enable-chunked-prefill # Better memory utilization for long contexts

The Two Flags That Cause the Most Issues

1. --max-model-len (Context Window)

This is the single most misconfigured setting. If you set this too low, Kindo will truncate conversations, agents will lose context mid-task, and workflows will fail on long documents.

ModelOfficial Max ContextWhat we’ve seen customers setImpact
Llama 4 Scout10,000,0008,096Model can barely hold a single conversation turn
Llama 3.1 70B128,0004,096Agents lose tool-call history after a few steps
DeepSeek-V3128,0008,192Workflow steps fail on any non-trivial input
Mistral Large128,00032,000Works for chat, breaks on long agent tasks

Rule of thumb: Set --max-model-len to the model’s full advertised context window, then reduce only if your hardware genuinely cannot support it. When reducing, start from the top and work down — don’t guess a low number.

2. --tool-call-parser (Tool/Function Calling)

Without this flag, models served through vLLM will not properly handle tool calls. Kindo agents depend on tool calling for every interaction with integrations, MCP servers, and multi-step workflows. If you skip this flag, agents will not work.

Model FamilyParser FlagNotes
Llama 4 (Scout, Maverick)--tool-call-parser llama4New parser added in vLLM 0.8+
Llama 3.1 / 3.3--tool-call-parser llama3_json
Qwen 2.5 / 3--tool-call-parser hermesUses Hermes-style tool format
Qwen 3 Coder--tool-call-parser qwen3_coderSpecific parser for coder variants
Mistral / Mixtral--tool-call-parser mistral
DeepSeek-V3 / R1--tool-call-parser hermesCheck vLLM docs for latest
Jamba--tool-call-parser jamba

Complete Example: Llama 4 Scout on 1× H100

Terminal window
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
--served-model-name llama-4-scout \
--port 8000 \
--max-model-len 1048576 \ # 1M tokens (reduced from 10M full context to fit single H100 VRAM)
--dtype bfloat16 \
--tensor-parallel-size 1 \
--tool-call-parser llama4 \
--enable-auto-tool-choice \
--enable-prefix-caching \
--enable-chunked-prefill

Complete Example: DeepSeek-V3 on 8× H100

Terminal window
vllm serve deepseek-ai/DeepSeek-V3 \
--served-model-name deepseek-v3 \
--port 8000 \
--max-model-len 128000 \
--dtype bfloat16 \
--tensor-parallel-size 8 \
--tool-call-parser hermes \
--enable-auto-tool-choice \
--enable-prefix-caching \
--enable-chunked-prefill

Complete Example: Mistral Large on 2× H100

Terminal window
vllm serve mistralai/Mistral-Large-Instruct-2411 \
--served-model-name mistral-large \
--port 8000 \
--max-model-len 128000 \
--dtype bfloat16 \
--tensor-parallel-size 2 \
--tool-call-parser mistral \
--enable-auto-tool-choice \
--enable-prefix-caching \
--enable-chunked-prefill

Model Family Quick Reference

Llama 4 (Scout, Maverick)

  • Architecture: Mixture of Experts (MoE)
  • Context: Up to 10M tokens (Scout), 1M tokens (Maverick)
  • Tool parser: llama4
  • Key notes: These are MoE models — the active parameter count is much lower than the total. Scout 17B-16E activates ~17B parameters per token despite having 109B total. Hardware requirements are based on total parameters (all experts must be in memory), but inference speed is based on active parameters.
  • Gotcha: Do not set --max-model-len 8096 on a model designed for 10M tokens. Start with at least 1M and reduce only if hardware requires it.

Llama 3.1 / 3.3

  • Context: 128k tokens
  • Tool parser: llama3_json
  • Key notes: The 70B Instruct variant is an excellent general-purpose model. Requires --tool-call-parser llama3_json — not llama3 (which doesn’t exist as a parser).

Mistral / Mixtral

  • Context: 32k–128k tokens depending on variant
  • Tool parser: mistral
  • Key notes: Mistral Large supports 128k context. Mixtral 8x7B uses MoE architecture. Both work well with vLLM but require the mistral parser for tool calling.

DeepSeek-V3 / DeepSeek-R1

  • Context: 128k tokens
  • Tool parser: hermes (check latest vLLM docs)
  • Key notes: DeepSeek-V3 is a large MoE model (671B total, ~37B active). Requires significant GPU memory (8× H100 recommended). R1 is the reasoning variant — if using for agentic workloads, ensure tool calling works with your vLLM version.

Qwen 2.5 / Qwen 3

  • Context: 32k–128k tokens depending on variant
  • Tool parser: hermes (Qwen 2.5), qwen3_coder (Qwen 3 Coder)
  • Key notes: Excellent multilingual support. Qwen 2.5 72B is a strong alternative to Llama 3.1 70B.

Verifying Your Model Before Connecting to Kindo

Always test your model deployment independently before adding it to Kindo. This isolates inference issues from Kindo configuration issues.

Step 1: Health Check

Verify the inference server is running and the model is loaded:

Terminal window
curl http://<your-inference-host>:<port>/v1/models

Expected response: a JSON object listing your served model name. If this fails, the model is not ready — do not proceed.

Step 2: Basic Completion Test

Terminal window
curl -X POST http://<your-inference-host>:<port>/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "<your-served-model-name>",
"messages": [
{"role": "user", "content": "What is 2 + 2? Reply with just the number."}
],
"max_tokens": 32,
"temperature": 0
}'

You should get a coherent response with "finish_reason": "stop". If the response is garbled, the model may be misconfigured (wrong dtype, corrupted weights, or insufficient VRAM).

Step 3: Tool Calling Test

This is the test most people skip — and the one that catches the most issues:

Terminal window
curl -X POST http://<your-inference-host>:<port>/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "<your-served-model-name>",
"messages": [
{"role": "user", "content": "What is the weather in San Francisco?"}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"}
},
"required": ["location"]
}
}
}
],
"tool_choice": "auto",
"max_tokens": 256
}'

Expected: The response should contain a tool_calls array with a call to get_weather and "location": "San Francisco" (or similar). If you get a plain text response instead of a tool call, your --tool-call-parser flag is missing or wrong.

Step 4: Context Length Test

Verify the model can actually handle the context length you configured:

Terminal window
# Generate a prompt that approaches your configured context length
# For a 128k context model, try sending ~100k tokens of input
python3 -c "
import json, sys
# ~4 chars per token, target 100k tokens
padding = 'The quick brown fox jumps over the lazy dog. ' * 11000
msg = {'model': '<your-served-model-name>', 'messages': [{'role': 'user', 'content': f'{padding}\n\nSummarize the above text in one sentence.'}], 'max_tokens': 64}
print(json.dumps(msg))
" | curl -X POST http://<your-inference-host>:<port>/v1/chat/completions \
-H "Content-Type: application/json" \
-d @-

If this returns a context-length error, your --max-model-len is too low or your GPU lacks sufficient memory. Adjust accordingly.

Connecting to Kindo

Once your model passes all verification steps:

  1. Add the model to Kindo using the model management API
  2. Set the context window to match your verified deployment (the value you set in --max-model-len)
  3. Configure Unleash feature variants to use the new model ID where appropriate (see Unleash Features)
  4. Enable multimodal support (if applicable) — if the model supports image or file input (e.g., vision models), add its model ID to the MULTIMODAL_MODELS Unleash flag so Kindo surfaces file/image upload capabilities in the UI
  5. Test end-to-end in Kindo — send a chat message, run an agent task, verify tool calling works

Networking and DNS

Model inference endpoints must be reachable from the Kindo Kubernetes cluster. This sounds obvious, but DNS resolution issues are the third most common deployment problem we see.

Checklist

  • Inference endpoint has a stable hostname or IP — not a localhost address
  • DNS resolves from inside the cluster — test with kubectl run -it --rm dns-test --image=busybox -- nslookup <your-inference-host>
  • Port is accessible — test with kubectl run -it --rm net-test --image=busybox -- wget -qO- http://<your-inference-host>:<port>/v1/models
  • No firewall blocking traffic between the Kubernetes cluster and the inference endpoint
  • TLS configured if required — if using HTTPS, ensure the certificate is trusted by the cluster (or use api_base with the correct scheme)

Common DNS Issues

SymptomLikely causeFix
Connection refused from KindoInference server not running or wrong portVerify the server is up: curl <host>:<port>/v1/models from outside the cluster
Name resolution failedDNS not configured for the inference hostnameAdd DNS entries or use IP addresses; verify with nslookup from inside a pod
Works from your laptop, fails from KindoSplit DNS or firewall rulesEnsure the Kubernetes nodes can reach the inference endpoint, not just your workstation
Intermittent timeoutsInference server overloaded or network instabilityCheck GPU utilization, consider scaling replicas or reducing concurrent requests

Common Pitfalls

A summary of the issues that catch most self-hosted deployments:

PitfallImpactPrevention
Context window too smallConversations truncated, agents lose context, workflows failSet --max-model-len to the model’s full advertised context window
Missing tool-call parserAgents cannot use tools, integrations brokenAlways set --tool-call-parser and --enable-auto-tool-choice
Wrong served-model-nameKindo can’t route requests to the modelEnsure --served-model-name matches litellmModelName in Kindo
DNS not reachable from clusterKindo returns connection errorsTest DNS resolution and connectivity from inside a Kubernetes pod
Skipping model verificationProblems blamed on Kindo that are actually inference issuesAlways run the verification steps before connecting
Outdated vLLM versionMissing parser support, known bugsUse vLLM 0.8+ for Llama 4 support, 0.6+ for most other models
Not reading official model docsWrong dtype, missing flags, unsupported featuresAlways start with the model provider’s official documentation

Next Steps