AI Model Deployment Guide
Self-Managed Kindo does not ship with AI models — you provide the inference infrastructure. This means you are responsible for deploying, configuring, and maintaining the models that power chat, agents, workflows, embeddings, and every other AI-driven feature in Kindo.
This guide helps you get model deployment right the first time. Whether you are connecting cloud API providers (OpenAI, Anthropic, Google), self-hosting models with vLLM, or mixing both — the principles here will save you hours of debugging.
Before You Start
Before deploying any model, answer these questions:
- What workloads will this model serve? Chat, agents, embeddings, transcription, or a combination? (See Minimum Model Requirements for the baseline.)
- Cloud API, self-hosted, or both? Cloud APIs are simpler to configure. Self-hosted models give you full control but require GPU infrastructure and careful tuning.
- What is the model’s full context window? Not what you think it is — what the official documentation says. Getting this wrong is the #1 deployment mistake we see.
- Does the model support tool/function calling? Kindo agents rely on tool calling. If your inference server needs a flag to enable it, you must set it.
Cloud API Models
Connecting cloud-hosted models (OpenAI, Anthropic, Google, Azure OpenAI, Groq) is the simplest path. You need an API key and the correct endpoint — no GPU infrastructure required.
Setup Checklist
- API key obtained from the provider (OpenAI, Anthropic, Google AI Studio, Azure, etc.)
- Endpoint URL confirmed — use the provider’s documented base URL, not a guess
- Model name verified — use the exact model identifier the provider expects (e.g.,
gpt-4o,claude-sonnet-4-20250514,gemini-2.5-pro) - Context window set correctly — match the provider’s documented limit for the specific model version
- Rate limits understood — know your tier’s tokens-per-minute and requests-per-minute limits
- Network access confirmed — your cluster can reach the provider’s API endpoint on port 443
Adding a Cloud Model to Kindo
Use the model management API to register the model. The key fields:
| Field | What to set | Common mistake |
|---|---|---|
contextWindow | The model’s full advertised context window | Setting it too low (e.g., 8,096 when the model supports 128,000) |
litellmModelName | A unique name for routing | Using a name that conflicts with an existing model |
litellmParams.model | Provider-prefixed model ID (see prefix table below) | Wrong prefix or outdated model version |
litellmParams.api_key | Your provider API key | Key for wrong environment or expired key |
litellmParams.api_base | Provider endpoint (only if non-default) | Setting this when it should be omitted for standard providers |
Provider prefix reference for litellmParams.model:
| Provider | Prefix | Example |
|---|---|---|
| OpenAI | openai/ | openai/gpt-4o |
| Anthropic | anthropic/ | anthropic/claude-sonnet-4-20250514 |
| Google Gemini | gemini/ | gemini/gemini-2.5-pro |
| Azure OpenAI | azure/ | azure/<your-deployment-name> |
| AWS Bedrock | bedrock/ | bedrock/us.anthropic.claude-sonnet-4-20250514-v1:0 |
| Self-hosted (vLLM) | openai/ | openai/llama-4-scout |
For the full API call, see Add a Global Model.
Provider-Specific Notes
OpenAI / Azure OpenAI:
- Azure requires
api_base,api_version, andapi_keyspecific to your Azure deployment - Model names differ between OpenAI and Azure (e.g.,
gpt-4ovs your Azure deployment name)
Anthropic:
- Use
anthropic/prefix inlitellmParams.model(e.g.,anthropic/claude-sonnet-4-20250514) - Supports up to 200k context window — set it accordingly
Google (Gemini):
- Use
gemini/prefix (e.g.,gemini/gemini-2.5-pro) - Gemini 2.5 Pro supports 1M tokens — do not cap it artificially
AWS Bedrock:
- Use
bedrock/prefix inlitellmParams.model(e.g.,bedrock/us.anthropic.claude-sonnet-4-20250514-v1:0) - Models must be explicitly enabled in the AWS Console under Bedrock → Model access before they can be used
- Use cross-region inference profiles for better availability (e.g.,
us.anthropic.claude-*instead of region-specific ARNs) - IAM permissions required:
aws-marketplace:ViewSubscriptionsandaws-marketplace:Subscribein addition to standardbedrock:InvokeModelpermissions - Ensure your AWS credentials (access key, secret key, region) are configured in
litellmParams
Self-Hosted Models
Self-hosted models give you complete control over data residency and model selection, but they require careful configuration. Most deployment issues we see come from incorrect inference server configuration, not from Kindo itself.
Pre-Deployment Checklist
Before you start any self-hosted model deployment:
- Hardware verified — GPUs have enough VRAM for the model at your target context length (see GPU sizing below)
- NVIDIA drivers and container runtime installed —
nvidia-smishows your GPUs,nvidia-container-runtimeis configured - Model weights downloaded — access tokens set, weights pulled successfully to local storage
- Official model documentation read — you know the recommended context window, quantization, and any special flags
- Context window sized correctly — set to the model’s full supported context length (or the maximum your hardware can serve)
- Tool-call parser identified — if the model supports tool calling, you know which
--tool-call-parserflag to use - DNS/hostname planned — the inference endpoint will be reachable from your Kubernetes cluster
GPU Sizing
The VRAM required depends on the model size, quantization, and the context length you want to serve. Context length is the primary variable — a 70B model serving 4k context needs far less VRAM than the same model serving 128k context due to KV-cache memory.
| Model Size | Quantization | Short Context (4k) | Medium Context (32k) | Full Context (128k+) |
|---|---|---|---|---|
| 7–8B | FP16/BF16 | 1× 24GB GPU | 1× 24GB GPU | 1× 48GB GPU |
| 7–8B | FP8 | 1× 24GB GPU | 1× 24GB GPU | 1× 24GB GPU |
| 13B | FP16/BF16 | 1× 48GB GPU | 1× 80GB GPU | 1× 80GB GPU |
| 13B | FP8 | 1× 24GB GPU | 1× 48GB GPU | 1× 80GB GPU |
| 30–34B | FP16/BF16 | 1× 80GB GPU | 2× 80GB GPU | 2–4× 80GB GPU |
| 30–34B | FP8 | 1× 48GB GPU | 1× 80GB GPU | 1–2× 80GB GPU |
| 70B | FP16/BF16 | 2× 80GB GPU | 4× 80GB GPU | 4–8× 80GB GPU |
| 70B | FP8 | 1× 80GB GPU | 2× 80GB GPU | 2–4× 80GB GPU |
| 70B | AWQ/GPTQ | 1× 80GB GPU | 2× 80GB GPU | 4× 80GB GPU |
vLLM Configuration
vLLM is the most common inference server for self-hosted models with Kindo. Getting the configuration right is critical.
Essential Flags
Every vLLM deployment should consider these flags:
vllm serve <model-id> \ --served-model-name <name> # Name Kindo will use to route requests --port <port> # Port to serve on --max-model-len <context-length> # CRITICAL: Set to the model's full context window --dtype bfloat16 # Use BF16 for modern GPUs (H100, A100, B200) --tensor-parallel-size <num-gpus> # Number of GPUs for tensor parallelism --tool-call-parser <parser> # CRITICAL for agents: enables tool/function calling --enable-auto-tool-choice # Let vLLM automatically route tool calls --enable-prefix-caching # Improves performance for repeated prompts --enable-chunked-prefill # Better memory utilization for long contextsThe Two Flags That Cause the Most Issues
1. --max-model-len (Context Window)
This is the single most misconfigured setting. If you set this too low, Kindo will truncate conversations, agents will lose context mid-task, and workflows will fail on long documents.
| Model | Official Max Context | What we’ve seen customers set | Impact |
|---|---|---|---|
| Llama 4 Scout | 10,000,000 | 8,096 | Model can barely hold a single conversation turn |
| Llama 3.1 70B | 128,000 | 4,096 | Agents lose tool-call history after a few steps |
| DeepSeek-V3 | 128,000 | 8,192 | Workflow steps fail on any non-trivial input |
| Mistral Large | 128,000 | 32,000 | Works for chat, breaks on long agent tasks |
Rule of thumb: Set --max-model-len to the model’s full advertised context window, then reduce only if your hardware genuinely cannot support it. When reducing, start from the top and work down — don’t guess a low number.
2. --tool-call-parser (Tool/Function Calling)
Without this flag, models served through vLLM will not properly handle tool calls. Kindo agents depend on tool calling for every interaction with integrations, MCP servers, and multi-step workflows. If you skip this flag, agents will not work.
| Model Family | Parser Flag | Notes |
|---|---|---|
| Llama 4 (Scout, Maverick) | --tool-call-parser llama4 | New parser added in vLLM 0.8+ |
| Llama 3.1 / 3.3 | --tool-call-parser llama3_json | |
| Qwen 2.5 / 3 | --tool-call-parser hermes | Uses Hermes-style tool format |
| Qwen 3 Coder | --tool-call-parser qwen3_coder | Specific parser for coder variants |
| Mistral / Mixtral | --tool-call-parser mistral | |
| DeepSeek-V3 / R1 | --tool-call-parser hermes | Check vLLM docs for latest |
| Jamba | --tool-call-parser jamba |
Complete Example: Llama 4 Scout on 1× H100
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \ --served-model-name llama-4-scout \ --port 8000 \ --max-model-len 1048576 \ # 1M tokens (reduced from 10M full context to fit single H100 VRAM) --dtype bfloat16 \ --tensor-parallel-size 1 \ --tool-call-parser llama4 \ --enable-auto-tool-choice \ --enable-prefix-caching \ --enable-chunked-prefillComplete Example: DeepSeek-V3 on 8× H100
vllm serve deepseek-ai/DeepSeek-V3 \ --served-model-name deepseek-v3 \ --port 8000 \ --max-model-len 128000 \ --dtype bfloat16 \ --tensor-parallel-size 8 \ --tool-call-parser hermes \ --enable-auto-tool-choice \ --enable-prefix-caching \ --enable-chunked-prefillComplete Example: Mistral Large on 2× H100
vllm serve mistralai/Mistral-Large-Instruct-2411 \ --served-model-name mistral-large \ --port 8000 \ --max-model-len 128000 \ --dtype bfloat16 \ --tensor-parallel-size 2 \ --tool-call-parser mistral \ --enable-auto-tool-choice \ --enable-prefix-caching \ --enable-chunked-prefillModel Family Quick Reference
Llama 4 (Scout, Maverick)
- Architecture: Mixture of Experts (MoE)
- Context: Up to 10M tokens (Scout), 1M tokens (Maverick)
- Tool parser:
llama4 - Key notes: These are MoE models — the active parameter count is much lower than the total. Scout 17B-16E activates ~17B parameters per token despite having 109B total. Hardware requirements are based on total parameters (all experts must be in memory), but inference speed is based on active parameters.
- Gotcha: Do not set
--max-model-len 8096on a model designed for 10M tokens. Start with at least 1M and reduce only if hardware requires it.
Llama 3.1 / 3.3
- Context: 128k tokens
- Tool parser:
llama3_json - Key notes: The 70B Instruct variant is an excellent general-purpose model. Requires
--tool-call-parser llama3_json— notllama3(which doesn’t exist as a parser).
Mistral / Mixtral
- Context: 32k–128k tokens depending on variant
- Tool parser:
mistral - Key notes: Mistral Large supports 128k context. Mixtral 8x7B uses MoE architecture. Both work well with vLLM but require the
mistralparser for tool calling.
DeepSeek-V3 / DeepSeek-R1
- Context: 128k tokens
- Tool parser:
hermes(check latest vLLM docs) - Key notes: DeepSeek-V3 is a large MoE model (671B total, ~37B active). Requires significant GPU memory (8× H100 recommended). R1 is the reasoning variant — if using for agentic workloads, ensure tool calling works with your vLLM version.
Qwen 2.5 / Qwen 3
- Context: 32k–128k tokens depending on variant
- Tool parser:
hermes(Qwen 2.5),qwen3_coder(Qwen 3 Coder) - Key notes: Excellent multilingual support. Qwen 2.5 72B is a strong alternative to Llama 3.1 70B.
Verifying Your Model Before Connecting to Kindo
Always test your model deployment independently before adding it to Kindo. This isolates inference issues from Kindo configuration issues.
Step 1: Health Check
Verify the inference server is running and the model is loaded:
curl http://<your-inference-host>:<port>/v1/modelsExpected response: a JSON object listing your served model name. If this fails, the model is not ready — do not proceed.
Step 2: Basic Completion Test
curl -X POST http://<your-inference-host>:<port>/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "<your-served-model-name>", "messages": [ {"role": "user", "content": "What is 2 + 2? Reply with just the number."} ], "max_tokens": 32, "temperature": 0 }'You should get a coherent response with "finish_reason": "stop". If the response is garbled, the model may be misconfigured (wrong dtype, corrupted weights, or insufficient VRAM).
Step 3: Tool Calling Test
This is the test most people skip — and the one that catches the most issues:
curl -X POST http://<your-inference-host>:<port>/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "<your-served-model-name>", "messages": [ {"role": "user", "content": "What is the weather in San Francisco?"} ], "tools": [ { "type": "function", "function": { "name": "get_weather", "description": "Get the current weather for a location", "parameters": { "type": "object", "properties": { "location": {"type": "string", "description": "City name"} }, "required": ["location"] } } } ], "tool_choice": "auto", "max_tokens": 256 }'Expected: The response should contain a tool_calls array with a call to get_weather and "location": "San Francisco" (or similar). If you get a plain text response instead of a tool call, your --tool-call-parser flag is missing or wrong.
Step 4: Context Length Test
Verify the model can actually handle the context length you configured:
# Generate a prompt that approaches your configured context length# For a 128k context model, try sending ~100k tokens of inputpython3 -c "import json, sys# ~4 chars per token, target 100k tokenspadding = 'The quick brown fox jumps over the lazy dog. ' * 11000msg = {'model': '<your-served-model-name>', 'messages': [{'role': 'user', 'content': f'{padding}\n\nSummarize the above text in one sentence.'}], 'max_tokens': 64}print(json.dumps(msg))" | curl -X POST http://<your-inference-host>:<port>/v1/chat/completions \ -H "Content-Type: application/json" \ -d @-If this returns a context-length error, your --max-model-len is too low or your GPU lacks sufficient memory. Adjust accordingly.
Connecting to Kindo
Once your model passes all verification steps:
- Add the model to Kindo using the model management API
- Set the context window to match your verified deployment (the value you set in
--max-model-len) - Configure Unleash feature variants to use the new model ID where appropriate (see Unleash Features)
- Enable multimodal support (if applicable) — if the model supports image or file input (e.g., vision models), add its model ID to the
MULTIMODAL_MODELSUnleash flag so Kindo surfaces file/image upload capabilities in the UI - Test end-to-end in Kindo — send a chat message, run an agent task, verify tool calling works
Networking and DNS
Model inference endpoints must be reachable from the Kindo Kubernetes cluster. This sounds obvious, but DNS resolution issues are the third most common deployment problem we see.
Checklist
- Inference endpoint has a stable hostname or IP — not a localhost address
- DNS resolves from inside the cluster — test with
kubectl run -it --rm dns-test --image=busybox -- nslookup <your-inference-host> - Port is accessible — test with
kubectl run -it --rm net-test --image=busybox -- wget -qO- http://<your-inference-host>:<port>/v1/models - No firewall blocking traffic between the Kubernetes cluster and the inference endpoint
- TLS configured if required — if using HTTPS, ensure the certificate is trusted by the cluster (or use
api_basewith the correct scheme)
Common DNS Issues
| Symptom | Likely cause | Fix |
|---|---|---|
Connection refused from Kindo | Inference server not running or wrong port | Verify the server is up: curl <host>:<port>/v1/models from outside the cluster |
Name resolution failed | DNS not configured for the inference hostname | Add DNS entries or use IP addresses; verify with nslookup from inside a pod |
| Works from your laptop, fails from Kindo | Split DNS or firewall rules | Ensure the Kubernetes nodes can reach the inference endpoint, not just your workstation |
| Intermittent timeouts | Inference server overloaded or network instability | Check GPU utilization, consider scaling replicas or reducing concurrent requests |
Common Pitfalls
A summary of the issues that catch most self-hosted deployments:
| Pitfall | Impact | Prevention |
|---|---|---|
| Context window too small | Conversations truncated, agents lose context, workflows fail | Set --max-model-len to the model’s full advertised context window |
| Missing tool-call parser | Agents cannot use tools, integrations broken | Always set --tool-call-parser and --enable-auto-tool-choice |
| Wrong served-model-name | Kindo can’t route requests to the model | Ensure --served-model-name matches litellmModelName in Kindo |
| DNS not reachable from cluster | Kindo returns connection errors | Test DNS resolution and connectivity from inside a Kubernetes pod |
| Skipping model verification | Problems blamed on Kindo that are actually inference issues | Always run the verification steps before connecting |
| Outdated vLLM version | Missing parser support, known bugs | Use vLLM 0.8+ for Llama 4 support, 0.6+ for most other models |
| Not reading official model docs | Wrong dtype, missing flags, unsupported features | Always start with the model provider’s official documentation |