Prepare AI Models
Self-Managed Kindo does not ship AI models — you provide the inference infrastructure. That means picking cloud API providers (OpenAI, Anthropic, Gemini, Bedrock, Azure), self-hosting open-weight models with vLLM, or mixing the two. This page shows you the moving parts and — most importantly for the CLI flow — how those models plug into install-contract.yaml under models: so kindo install --apply can register them with the platform.
What the CLI does for you
During the post-install step, kindo-cli reads the models: array from your install-contract.yaml and registers each entry with the Kindo admin API. After this step:
- Models exist in the
Modeldatabase table with IDs assigned. - Unleash feature flag variants (
EMBEDDING_MODELS,AUDIO_TRANSCRIPTION,TOOL_CALLING_MODELS,DEFAULT_WORKFLOW_STEP_MODEL, etc.) are updated to point at those model IDs. - LiteLLM is configured to route traffic to the correct endpoint for each model name.
Your job on this page is to:
- Decide which providers/endpoints you will use.
- Make sure each endpoint is reachable from the cluster.
- Describe them in
models:using the contract shape below.
The models: contract shape
models: is a list of dicts. Every entry uses the same top-level keys regardless of provider — only the credential fields change. The CLI writes this during kindo config init; you can also edit by hand and re-validate.
models: - litellmModelName: claude-sonnet-4-20250514 # unique routing key, matches Unleash variants litellmModel: anthropic/claude-sonnet-4-20250514 # LiteLLM model ID with provider prefix displayName: Claude Sonnet 4 # shown in the UI model picker provider: Anthropic # ModelProvider display name (must match exactly) creator: Anthropic type: CHAT # CHAT | INTERNAL metadataType: Text Generation # Text Generation | Embeddings | Transcription costTier: MEDIUM # LOW | MEDIUM | HIGH contextWindow: 200000 # the model's full advertised context window maxTokens: 16384 # max output tokens apiKey: ${ANTHROPIC_API_KEY} # or inline secret, or env var description: Strong general-purpose model for chat and agents docLink: https://docs.anthropic.com/Field reference
| Field | Required | What it is |
|---|---|---|
litellmModelName | yes | Unique routing key. Must match --served-model-name for vLLM and the names referenced by Unleash variants. |
litellmModel | yes | Provider-prefixed LiteLLM model ID (see prefix table below). |
displayName | yes | Human-readable name shown in the UI. |
provider | yes | Maps to modelProviderDisplayName — must match an existing ModelProvider row or a new one is created. |
creator | recommended | Model author (Anthropic, OpenAI, NVIDIA, Meta, etc.). |
type | yes | CHAT for user-facing LLMs, INTERNAL for embeddings/transcription. |
metadataType | yes | Text Generation, Embeddings, or Transcription. Drives Unleash flag routing. |
costTier | recommended | LOW, MEDIUM, or HIGH. |
contextWindow | yes | Set to the model’s full advertised context window — do not low-ball it. |
maxTokens | recommended | Maximum output tokens LiteLLM will request. |
description / docLink | optional | Surfaced in the model picker. |
| Credentials | varies | See per-provider sections below (apiKey, apiBase, apiVersion, awsRegion, awsAccessKeyId, awsSecretAccessKey, baseModel). |
Provider prefix reference
| Provider | litellmModel prefix | Example |
|---|---|---|
| OpenAI | openai/ | openai/gpt-4o |
| Anthropic | anthropic/ | anthropic/claude-sonnet-4-20250514 |
| Google Gemini | gemini/ | gemini/gemini-2.5-pro |
| Azure OpenAI | azure/ | azure/<your-deployment-name> |
| AWS Bedrock | bedrock/ | bedrock/us.anthropic.claude-sonnet-4-20250514-v1:0 |
| Self-hosted (vLLM) | openai/ | openai/deephat-v2 |
Cloud provider playbooks
Cloud models are the simplest path — an API key, the right prefix, and network egress. No GPUs, no inference tuning.
models: - litellmModelName: gpt-4o litellmModel: openai/gpt-4o displayName: GPT-4o provider: OpenAI creator: OpenAI type: CHAT metadataType: Text Generation costTier: MEDIUM contextWindow: 128000 maxTokens: 16384 apiKey: sk-proj-...- Use the exact model ID OpenAI publishes (e.g.
gpt-4o,gpt-4o-mini). Outdated IDs silently fall back or fail. - Omit
apiBase— LiteLLM routes tohttps://api.openai.com/v1by default. - Organization ID can be supplied by setting
OPENAI_ORG_IDin the LiteLLM environment if needed.
models: - litellmModelName: claude-sonnet-4-20250514 litellmModel: anthropic/claude-sonnet-4-20250514 displayName: Claude Sonnet 4 provider: Anthropic creator: Anthropic type: CHAT metadataType: Text Generation costTier: MEDIUM contextWindow: 200000 maxTokens: 16384 apiKey: sk-ant-...- Use the dated model ID (e.g.
claude-sonnet-4-20250514), not an alias, for reproducible deployments.
models: - litellmModelName: gemini-2.5-pro litellmModel: gemini/gemini-2.5-pro displayName: Gemini 2.5 Pro provider: Google creator: Google type: CHAT metadataType: Text Generation costTier: MEDIUM contextWindow: 1000000 maxTokens: 65536 apiKey: AIza...models: - litellmModelName: claude-sonnet-4-bedrock litellmModel: bedrock/us.anthropic.claude-sonnet-4-20250514-v1:0 displayName: Claude Sonnet 4 (Bedrock) provider: AWS Bedrock creator: Anthropic type: CHAT metadataType: Text Generation costTier: MEDIUM contextWindow: 200000 maxTokens: 16384 awsRegion: us-west-2 awsAccessKeyId: AKIA... awsSecretAccessKey: ...- Enable the model in the AWS Console under Bedrock → Model access before use.
- Use cross-region inference profiles (e.g.
us.anthropic.claude-*) instead of region-specific ARNs for better availability. - IAM requires
bedrock:InvokeModelplusaws-marketplace:ViewSubscriptionsandaws-marketplace:Subscribe. - If you leave
awsAccessKeyId/awsSecretAccessKeyempty, LiteLLM falls back to the IRSA role on the cluster — only do this if your service account has Bedrock permissions.
models: - litellmModelName: azure-gpt-4o litellmModel: azure/my-gpt4o-deployment displayName: GPT-4o (Azure) provider: Azure OpenAI creator: OpenAI type: CHAT metadataType: Text Generation costTier: MEDIUM contextWindow: 128000 maxTokens: 16384 apiBase: https://my-resource.openai.azure.com apiVersion: '2024-10-21' apiKey: ... baseModel: gpt-4o # optional: LiteLLM model_info.base_model for pricinglitellmModeluses your Azure deployment name, not the OpenAI model name.apiBase,apiVersion, andapiKeyare all required — Azure will not route without them.- Set
baseModelto the underlying OpenAI model (gpt-4o,gpt-4o-mini, etc.) so LiteLLM can report correct pricing and capabilities.
Self-hosted vLLM patterns
Self-hosting gets you full data-residency control and unlimited request volume, but you own the inference server. vLLM is what we recommend, and what this section covers.
Pre-deployment checklist
- GPUs sized for your target context length (table below).
- NVIDIA drivers +
nvidia-container-runtimeinstalled —nvidia-smishows your GPUs. - Model weights pulled (HuggingFace token exported:
export HF_TOKEN=...). - Official model card read — context window, max output tokens, dtype, and any special flags.
- Correct
--tool-call-parseridentified for your model family. - DNS/hostname planned — the endpoint will be reachable from the cluster.
GPU sizing
Context length is the primary VRAM driver via the KV cache. A 70B model at 4k context needs far less VRAM than the same model at 128k.
| Model size | Quantization | Short (4k) | Medium (32k) | Full (128k+) |
|---|---|---|---|---|
| 7–8B | FP16/BF16 | 1× 24GB | 1× 24GB | 1× 48GB |
| 7–8B | FP8 | 1× 24GB | 1× 24GB | 1× 24GB |
| 13B | FP16/BF16 | 1× 48GB | 1× 80GB | 1× 80GB |
| 13B | FP8 | 1× 24GB | 1× 48GB | 1× 80GB |
| 30–34B | FP16/BF16 | 1× 80GB | 2× 80GB | 2–4× 80GB |
| 30–34B | FP8 | 1× 48GB | 1× 80GB | 1–2× 80GB |
| 70B | FP16/BF16 | 2× 80GB | 4× 80GB | 4–8× 80GB |
| 70B | FP8 | 1× 80GB | 2× 80GB | 2–4× 80GB |
| 70B | AWQ/GPTQ | 1× 80GB | 2× 80GB | 4× 80GB |
Essential vLLM flags
vllm serve <model-id> \ --served-model-name <name> # must match litellmModelName in the contract --port 8000 \ --max-model-len <context-length> # the model's full context window --dtype bfloat16 \ --tensor-parallel-size <num-gpus> \ --tool-call-parser <parser> \ # required for agents --enable-auto-tool-choice \ # required alongside tool-call-parser --enable-prefix-caching \ --enable-chunked-prefillTwo flags cause the majority of issues:
--max-model-len— set to the model’s full context. If vLLM OOMs on startup, it will report the maximum your hardware can support; use that number, don’t guess a lower one.--tool-call-parser— without it, Kindo agents cannot call tools. Always pair with--enable-auto-tool-choice.
Tool-call parser reference
| Model family | --tool-call-parser | Notes |
|---|---|---|
| NVIDIA Nemotron 3 Super | qwen3_coder | Also set --reasoning-parser nemotron_v3. Requires --trust-remote-code. |
| Mistral / Mixtral | mistral | |
| DeepSeek-V3 / R1 | hermes | Verify against latest vLLM release. |
| Qwen 3 / Qwen 3 Coder | qwen3_coder | |
| DeepHat V2 | qwen3_coder | See DeepHat section below. |
| GPT OSS | openai | Also set --reasoning-parser openai_gptoss. |
--served-model-name and the contract
--served-model-name is the string vLLM uses for the model field in its OpenAI-compatible API. Kindo routes requests via LiteLLM, which resolves a litellmModelName to the litellmModel (e.g. openai/my-name), then strips the openai/ prefix and sends "model": "my-name" to your inference server.
The rule: --served-model-name in vLLM must match the suffix of litellmModel and the litellmModelName in your contract.
vllm serve Qwen/Qwen3-Coder-30B-Instruct \ --served-model-name qwen3-coder-30b \ ...models: - litellmModelName: qwen3-coder-30b # same as --served-model-name litellmModel: openai/qwen3-coder-30b # same suffix, openai/ prefix apiBase: http://vllm-qwen.inference.svc.cluster.local:8000/v1 ...Full example: NVIDIA Nemotron 3 Super on 4× H100
vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \ --served-model-name nemotron \ --port 8000 \ --max-model-len 1000000 \ --kv-cache-dtype fp8 \ --dtype bfloat16 \ --tensor-parallel-size 4 \ --trust-remote-code \ --tool-call-parser qwen3_coder \ --enable-auto-tool-choice \ --reasoning-parser nemotron_v3DeepHat
DeepHat V2 is Kindo’s cybersecurity-focused model built for offensive reasoning, long-context analysis, and secure execution. Treat it like any other self-hosted vLLM model in the contract — the specifics below just call out its hardware and flag requirements.
Hardware requirements
DeepHat V2 serves its full 250k-token context on:
- 1× B200 GPU, or
- 2× H100 GPUs.
A single H100 can serve at most ~90,000 tokens due to KV-cache memory.
Dependencies
vllm>=0.14.1(brings all implicit dependencies).- A HuggingFace access token provisioned by Kindo:
export HF_TOKEN=<token>.
Option A: Standalone vLLM
vllm serve DeepHat/DeepHat-V2-ext \ --served-model-name deephat-v2 \ --port 8000 \ --max-model-len 250000 \ --dtype bfloat16 \ --tensor-parallel-size 2 \ --tool-call-parser qwen3_coder \ --enable-prefix-caching \ --enable-chunked-prefill \ --enable-auto-tool-choicevllm serve DeepHat/DeepHat-V2-ext \ --served-model-name deephat-v2 \ --port 8000 \ --max-model-len 250000 \ --dtype bfloat16 \ --tensor-parallel-size 1 \ --tool-call-parser qwen3_coder \ --enable-prefix-caching \ --enable-chunked-prefill \ --enable-auto-tool-choiceOption B: Optional Helm chart
Skip this if you already have LLM serving infrastructure — run vLLM wherever you want and point Kindo at it. If you want a turnkey in-cluster deployment, Kindo ships a Helm chart.
values-deephat.yaml for 2× H100:
model: DeepHat/DeepHat-V2-extservedModelName: deephat-v2maxModelLen: 250000dtype: bfloat16tensorParallelSize: '2'enableChunkedPrefill: 'true'enablePrefixCaching: 'true'enableAutoToolChoice: 'true'toolCallParser: 'qwen3_coder'
resources: limits: nvidia.com/gpu: 2 requests: nvidia.com/gpu: 2
hfToken: '<your_huggingface_token>'vllmApiKey: '<your_api_key>'For 1× B200, set tensorParallelSize: "1" and GPU resources to 1.
Deploy:
helm install deephat-v2 . -f values-deephat.yamlDeepHat in the contract
models: - litellmModelName: deephat-v2 litellmModel: openai/deephat-v2 displayName: DeepHat V2 provider: Self-Hosted creator: Kindo type: CHAT metadataType: Text Generation costTier: LOW contextWindow: 250000 maxTokens: 16384 apiBase: http://deephat-v2.inference.svc.cluster.local:8000/v1 apiKey: ${VLLM_API_KEY} # whatever you set as vllmApiKey above description: Kindo's cybersecurity-focused model with 250k contextVerify: Is the endpoint reachable from the cluster?
Always verify the endpoint from inside the cluster before running kindo install --apply. Local curl from your laptop is not sufficient — split DNS and firewall rules catch this out constantly.
Step 1: DNS from a pod
kubectl run dns-test --rm -it --restart=Never \ --image=busybox:1.36 \ -- nslookup <your-inference-host>Step 2: HTTP /v1/models from a pod
For a cluster-internal vLLM service:
kubectl run net-test --rm -it --restart=Never \ --image=curlimages/curl:8.10.1 \ -- curl -sS -m 5 http://<your-inference-host>:<port>/v1/modelsFor a cloud provider, use the provider’s host with your key. Example — Anthropic:
kubectl run net-test --rm -it --restart=Never \ --image=curlimages/curl:8.10.1 \ --env ANTHROPIC_API_KEY=sk-ant-... \ -- sh -c 'curl -sS -m 10 https://api.anthropic.com/v1/models \ -H "x-api-key: $ANTHROPIC_API_KEY" \ -H "anthropic-version: 2023-06-01"'Expected: a JSON response listing the model(s). If this fails, fix connectivity before proceeding — kindo install --apply will fail the same way.
Step 3 (self-hosted only): tool-call smoke test
kubectl run tool-test --rm -it --restart=Never \ --image=curlimages/curl:8.10.1 \ -- curl -sS -m 30 \ -X POST http://<your-inference-host>:<port>/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "<your-served-model-name>", "messages": [{"role": "user", "content": "What is the weather in SF?"}], "tools": [{ "type": "function", "function": { "name": "get_weather", "description": "Get current weather for a location", "parameters": { "type": "object", "properties": {"location": {"type": "string"}}, "required": ["location"] } } }], "tool_choice": "auto", "max_tokens": 256 }'Expected: the response contains a tool_calls array with "name": "get_weather" and "location": "San Francisco" (or similar). A plain text response means your --tool-call-parser is missing or wrong — agents will not work.
Common pitfalls
| Pitfall | Impact | Prevention |
|---|---|---|
contextWindow too small | Conversations truncated, agents lose context | Use the model’s full advertised window |
Missing --tool-call-parser | Agents cannot call tools | Always pair with --enable-auto-tool-choice |
--served-model-name ≠ litellmModelName | LiteLLM returns 404 | Keep both strings identical |
| DNS not reachable from cluster | kindo install --apply fails at post-install | Verify with kubectl run ... nslookup first |
| Bedrock model not enabled in AWS | 403 from LiteLLM | Enable via Bedrock → Model access before install |
Azure missing apiVersion | LiteLLM rejects requests | apiBase, apiVersion, and apiKey are all required |
| Skipped verification | Kindo blamed for inference bugs | Run all three verification steps before install |
Next
Once models: is filled in and each endpoint passes the kubectl run checks, you’re ready to install.