AI Model Deployment Guide

Self-Managed Kindo does not ship with AI models — you provide the inference infrastructure. This means you are responsible for deploying, configuring, and maintaining the models that power chat, agents, workflows, embeddings, and every other AI-driven feature in Kindo.

This guide helps you get model deployment right the first time. Whether you are connecting cloud API providers (OpenAI, Anthropic, Google), self-hosting models with vLLM, or mixing both — the principles here will save you hours of debugging.

Before You Start

Before deploying any model, answer these questions:

What workloads will this model serve? Chat, agents, embeddings, transcription, or a combination? (See Minimum Model Requirements for the baseline.)
Cloud API, self-hosted, or both? Cloud APIs are simpler to configure. Self-hosted models give you full control but require GPU infrastructure and careful tuning.
What is the model’s full context window? Not what you think it is — what the official documentation says. Getting this wrong is the #1 deployment mistake we see.
Does the model support tool/function calling? Kindo agents rely on tool calling. If your inference server needs a flag to enable it, you must set it.

Cloud API Models

Connecting cloud-hosted models (OpenAI, Anthropic, Google, Azure OpenAI, Groq) is the simplest path. You need an API key and the correct endpoint — no GPU infrastructure required.

Setup Checklist

API key obtained from the provider (OpenAI, Anthropic, Google AI Studio, Azure, etc.)
Endpoint URL confirmed — use the provider’s documented base URL, not a guess
Model name verified — use the exact model identifier the provider expects (e.g., gpt-4o, claude-sonnet-4-20250514, gemini-2.5-pro)
Context window set correctly — match the provider’s documented limit for the specific model version
Rate limits understood — know your tier’s tokens-per-minute and requests-per-minute limits
Network access confirmed — your cluster can reach the provider’s API endpoint on port 443

Adding a Cloud Model to Kindo

Use the model management API to register the model. The key fields:

Field	What to set	Common mistake
`contextWindow`	The model’s full advertised context window	Setting it too low (e.g., 8,096 when the model supports 128,000)
`litellmModelName`	A unique name for routing	Using a name that conflicts with an existing model
`litellmParams.model`	Provider-prefixed model ID (see prefix table below)	Wrong prefix or outdated model version
`litellmParams.api_key`	Your provider API key	Key for wrong environment or expired key
`litellmParams.api_base`	Provider endpoint (only if non-default)	Setting this when it should be omitted for standard providers

Provider prefix reference for litellmParams.model:

Provider	Prefix	Example
OpenAI	`openai/`	`openai/gpt-4o`
Anthropic	`anthropic/`	`anthropic/claude-sonnet-4-20250514`
Google Gemini	`gemini/`	`gemini/gemini-2.5-pro`
Azure OpenAI	`azure/`	`azure/<your-deployment-name>`
AWS Bedrock	`bedrock/`	`bedrock/us.anthropic.claude-sonnet-4-20250514-v1:0`
Self-hosted (vLLM)	`openai/`	`openai/nemotron`

For the full API call, see Add a Global Model.

Provider-Specific Notes

OpenAI / Azure OpenAI:

Azure requires api_base, api_version, and api_key specific to your Azure deployment
Model names differ between OpenAI and Azure (e.g., gpt-4o vs your Azure deployment name)

Anthropic:

Use anthropic/ prefix in litellmParams.model (e.g., anthropic/claude-sonnet-4-20250514)
Supports up to 200k context window — set it accordingly

Google (Gemini):

Use gemini/ prefix (e.g., gemini/gemini-2.5-pro)
Gemini 2.5 Pro supports 1M tokens — do not cap it artificially

AWS Bedrock:

Use bedrock/ prefix in litellmParams.model (e.g., bedrock/us.anthropic.claude-sonnet-4-20250514-v1:0)
Models must be explicitly enabled in the AWS Console under Bedrock → Model access before they can be used
Use cross-region inference profiles for better availability (e.g., us.anthropic.claude-* instead of region-specific ARNs)
IAM permissions required: aws-marketplace:ViewSubscriptions and aws-marketplace:Subscribe in addition to standard bedrock:InvokeModel permissions
Ensure your AWS credentials (access key, secret key, region) are configured in litellmParams

Self-Hosted Models

Self-hosted models give you complete control over data residency and model selection, but they require careful configuration. Most deployment issues we see come from incorrect inference server configuration, not from Kindo itself.

Pre-Deployment Checklist

Before you start any self-hosted model deployment:

Hardware verified — GPUs have enough VRAM for the model at your target context length (see GPU sizing below)
NVIDIA drivers and container runtime installed — nvidia-smi shows your GPUs, nvidia-container-runtime is configured
Model weights downloaded — access tokens set, weights pulled successfully to local storage
Official model documentation read — you know the recommended context window, quantization, and any special flags
Context window sized correctly — set to the model’s full supported context length (or the maximum your hardware can serve)
Tool-call parser identified — if the model supports tool calling, you know which --tool-call-parser flag to use
DNS/hostname planned — the inference endpoint will be reachable from your Kubernetes cluster

GPU Sizing

The VRAM required depends on the model size, quantization, and the context length you want to serve. Context length is the primary variable — a 70B model serving 4k context needs far less VRAM than the same model serving 128k context due to KV-cache memory.

Model Size	Quantization	Short Context (4k)	Medium Context (32k)	Full Context (128k+)
7–8B	FP16/BF16	1× 24GB GPU	1× 24GB GPU	1× 48GB GPU
7–8B	FP8	1× 24GB GPU	1× 24GB GPU	1× 24GB GPU
13B	FP16/BF16	1× 48GB GPU	1× 80GB GPU	1× 80GB GPU
13B	FP8	1× 24GB GPU	1× 48GB GPU	1× 80GB GPU
30–34B	FP16/BF16	1× 80GB GPU	2× 80GB GPU	2–4× 80GB GPU
30–34B	FP8	1× 48GB GPU	1× 80GB GPU	1–2× 80GB GPU
70B	FP16/BF16	2× 80GB GPU	4× 80GB GPU	4–8× 80GB GPU
70B	FP8	1× 80GB GPU	2× 80GB GPU	2–4× 80GB GPU
70B	AWQ/GPTQ	1× 80GB GPU	2× 80GB GPU	4× 80GB GPU

vLLM Configuration

vLLM is the most common inference server for self-hosted models with Kindo. Getting the configuration right is critical.

Essential Flags

Every vLLM deployment should consider these flags:

vllm serve <model-id> \
  --served-model-name <name>          # Name Kindo will use to route requests
  --port <port>                       # Port to serve on
  --max-model-len <context-length>    # CRITICAL: Set to the model's full context window
  --dtype bfloat16                    # Use BF16 for modern GPUs (H100, A100, B200)
  --tensor-parallel-size <num-gpus>   # Number of GPUs for tensor parallelism
  --tool-call-parser <parser>         # CRITICAL for agents: enables tool/function calling
  --enable-auto-tool-choice           # Let vLLM automatically route tool calls
  --enable-prefix-caching             # Improves performance for repeated prompts
  --enable-chunked-prefill            # Better memory utilization for long contexts

The Two Flags That Cause the Most Issues

1. --max-model-len (Context Window)

This is the single most misconfigured setting. If you set this too low, Kindo will truncate conversations, agents will lose context mid-task, and workflows will fail on long documents.

Model	Official Max Context	What we’ve seen customers set	Impact
Nemotron 3 Super	1,000,000	8,096	Model can barely hold a single conversation turn
Llama 3.1 70B	128,000	4,096	Agents lose tool-call history after a few steps
DeepSeek-V3	128,000	8,192	Workflow steps fail on any non-trivial input
Mistral Large	128,000	32,000	Works for chat, breaks on long agent tasks

Rule of thumb: Set --max-model-len to the model’s full advertised context window, then reduce only if your hardware genuinely cannot support it. When reducing, start from the top and work down — don’t guess a low number.

2. --tool-call-parser (Tool/Function Calling)

Without this flag, models served through vLLM will not properly handle tool calls. Kindo agents depend on tool calling for every interaction with integrations, MCP servers, and multi-step workflows. If you skip this flag, agents will not work.

Model Family	Parser Flag	Notes
NVIDIA Nemotron 3 Super	`--tool-call-parser qwen3_coder`	Hybrid MoE; also set `--reasoning-parser nemotron_v3` for thinking support
Llama 3.1 / 3.3	`--tool-call-parser llama3_json`
Qwen 2.5 / 3	`--tool-call-parser hermes`	Uses Hermes-style tool format
Qwen 3 Coder	`--tool-call-parser qwen3_coder`	Specific parser for coder variants
Mistral / Mixtral	`--tool-call-parser mistral`
DeepSeek-V3 / R1	`--tool-call-parser hermes`	Check vLLM docs for latest
Jamba	`--tool-call-parser jamba`

Complete Example: NVIDIA Nemotron 3 Super on 4× H100

vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
  --served-model-name nemotron \
  --port 8000 \
  --max-model-len 1000000 \             # 1M token context window
  --kv-cache-dtype fp8 \
  --dtype bfloat16 \
  --tensor-parallel-size 4 \
  --trust-remote-code \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --reasoning-parser nemotron_v3

Complete Example: DeepSeek-V3 on 8× H100

vllm serve deepseek-ai/DeepSeek-V3 \
  --served-model-name deepseek-v3 \
  --port 8000 \
  --max-model-len 128000 \
  --dtype bfloat16 \
  --tensor-parallel-size 8 \
  --tool-call-parser hermes \
  --enable-auto-tool-choice \
  --enable-prefix-caching \
  --enable-chunked-prefill

Complete Example: Mistral Large on 2× H100

vllm serve mistralai/Mistral-Large-Instruct-2411 \
  --served-model-name mistral-large \
  --port 8000 \
  --max-model-len 128000 \
  --dtype bfloat16 \
  --tensor-parallel-size 2 \
  --tool-call-parser mistral \
  --enable-auto-tool-choice \
  --enable-prefix-caching \
  --enable-chunked-prefill

Model Family Quick Reference

NVIDIA Nemotron 3 Super

Architecture: Hybrid Mamba-Transformer Mixture of Experts (MoE) with Multi-Token Prediction
Context: Up to 1M tokens
Tool parser: qwen3_coder (also set --reasoning-parser nemotron_v3 for thinking/reasoning support)
Key notes: 120B total parameters with only 12B active at inference — extremely efficient for its accuracy class. Latent MoE enables calling 4 experts for the inference cost of only one. Requires --trust-remote-code. Available in BF16, FP8, and NVFP4 precisions (NVFP4 on Blackwell delivers 4× throughput vs FP8 on H100).
Gotcha: Do not set --max-model-len 8096 on a model designed for 1M tokens. Start with the full 1M context and reduce only if hardware requires it.

Llama 3.1 / 3.3

Context: 128k tokens
Tool parser: llama3_json
Key notes: The 70B Instruct variant is an excellent general-purpose model. Requires --tool-call-parser llama3_json — not llama3 (which doesn’t exist as a parser).

Mistral / Mixtral

Context: 32k–128k tokens depending on variant
Tool parser: mistral
Key notes: Mistral Large supports 128k context. Mixtral 8x7B uses MoE architecture. Both work well with vLLM but require the mistral parser for tool calling.

DeepSeek-V3 / DeepSeek-R1

Context: 128k tokens
Tool parser: hermes (check latest vLLM docs)
Key notes: DeepSeek-V3 is a large MoE model (671B total, ~37B active). Requires significant GPU memory (8× H100 recommended). R1 is the reasoning variant — if using for agentic workloads, ensure tool calling works with your vLLM version.

Qwen 2.5 / Qwen 3

Context: 32k–128k tokens depending on variant
Tool parser: hermes (Qwen 2.5), qwen3_coder (Qwen 3 Coder)
Key notes: Excellent multilingual support. Qwen 2.5 72B is a strong alternative to Llama 3.1 70B.

Verifying Your Model Before Connecting to Kindo

Always test your model deployment independently before adding it to Kindo. This isolates inference issues from Kindo configuration issues.

Step 1: Health Check

Verify the inference server is running and the model is loaded:

curl http://<your-inference-host>:<port>/v1/models

Expected response: a JSON object listing your served model name. If this fails, the model is not ready — do not proceed.

Step 2: Basic Completion Test

curl -X POST http://<your-inference-host>:<port>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<your-served-model-name>",
    "messages": [
      {"role": "user", "content": "What is 2 + 2? Reply with just the number."}
    ],
    "max_tokens": 32,
    "temperature": 0
  }'

You should get a coherent response with "finish_reason": "stop". If the response is garbled, the model may be misconfigured (wrong dtype, corrupted weights, or insufficient VRAM).

Step 3: Tool Calling Test

This is the test most people skip — and the one that catches the most issues:

curl -X POST http://<your-inference-host>:<port>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<your-served-model-name>",
    "messages": [
      {"role": "user", "content": "What is the weather in San Francisco?"}
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_weather",
          "description": "Get the current weather for a location",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {"type": "string", "description": "City name"}
            },
            "required": ["location"]
          }
        }
      }
    ],
    "tool_choice": "auto",
    "max_tokens": 256
  }'

Expected: The response should contain a tool_calls array with a call to get_weather and "location": "San Francisco" (or similar). If you get a plain text response instead of a tool call, your --tool-call-parser flag is missing or wrong.

Step 4: Context Length Test

Verify the model can actually handle the context length you configured:

# Generate a prompt that approaches your configured context length
# For a 128k context model, try sending ~100k tokens of input
python3 -c "
import json, sys
# ~4 chars per token, target 100k tokens
padding = 'The quick brown fox jumps over the lazy dog. ' * 11000
msg = {'model': '<your-served-model-name>', 'messages': [{'role': 'user', 'content': f'{padding}\n\nSummarize the above text in one sentence.'}], 'max_tokens': 64}
print(json.dumps(msg))
" | curl -X POST http://<your-inference-host>:<port>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d @-

If this returns a context-length error, your --max-model-len is too low or your GPU lacks sufficient memory. Adjust accordingly.

Connecting to Kindo

Once your model passes all verification steps:

Add the model to Kindo using the model management API
Set the context window to match your verified deployment (the value you set in --max-model-len)
Configure Unleash feature variants to use the new model ID where appropriate (see Unleash Features)
Enable multimodal support (if applicable) — if the model supports image or file input (e.g., vision models), add its model ID to the MULTIMODAL_MODELS Unleash flag so Kindo surfaces file/image upload capabilities in the UI
Test end-to-end in Kindo — send a chat message, run an agent task, verify tool calling works

Networking and DNS

Model inference endpoints must be reachable from the Kindo Kubernetes cluster. This sounds obvious, but DNS resolution issues are the third most common deployment problem we see.

Checklist

Inference endpoint has a stable hostname or IP — not a localhost address
DNS resolves from inside the cluster — test with kubectl run -it --rm dns-test --image=busybox -- nslookup <your-inference-host>
Port is accessible — test with kubectl run -it --rm net-test --image=busybox -- wget -qO- http://<your-inference-host>:<port>/v1/models
No firewall blocking traffic between the Kubernetes cluster and the inference endpoint
TLS configured if required — if using HTTPS, ensure the certificate is trusted by the cluster (or use api_base with the correct scheme)

Common DNS Issues

Symptom	Likely cause	Fix
`Connection refused` from Kindo	Inference server not running or wrong port	Verify the server is up: `curl <host>:<port>/v1/models` from outside the cluster
`Name resolution failed`	DNS not configured for the inference hostname	Add DNS entries or use IP addresses; verify with `nslookup` from inside a pod
Works from your laptop, fails from Kindo	Split DNS or firewall rules	Ensure the Kubernetes nodes can reach the inference endpoint, not just your workstation
Intermittent timeouts	Inference server overloaded or network instability	Check GPU utilization, consider scaling replicas or reducing concurrent requests

Common Pitfalls

A summary of the issues that catch most self-hosted deployments:

Pitfall	Impact	Prevention
Context window too small	Conversations truncated, agents lose context, workflows fail	Set `--max-model-len` to the model’s full advertised context window
Missing tool-call parser	Agents cannot use tools, integrations broken	Always set `--tool-call-parser` and `--enable-auto-tool-choice`
Wrong served-model-name	Kindo can’t route requests to the model	Ensure `--served-model-name` matches `litellmModelName` in Kindo
DNS not reachable from cluster	Kindo returns connection errors	Test DNS resolution and connectivity from inside a Kubernetes pod
Skipping model verification	Problems blamed on Kindo that are actually inference issues	Always run the verification steps before connecting
Outdated vLLM version	Missing parser support, known bugs	Use vLLM 0.17+ for Nemotron 3 Super support, 0.8+ for most other recent models
Not reading official model docs	Wrong dtype, missing flags, unsupported features	Always start with the model provider’s official documentation

Next Steps

Model Configuration in Kindo Add models, configure Unleash feature flags, and manage the global model set.

Deploy DeepHat Set up Kindo's cybersecurity-focused model with vLLM.

Infrastructure Requirements Hardware, network, and GPU prerequisites.

Troubleshooting Common deployment issues and how to resolve them.