Skip to content

Prepare AI Models

Self-Managed Kindo does not ship AI models — you provide the inference infrastructure. That means picking cloud API providers (OpenAI, Anthropic, Gemini, Bedrock, Azure), self-hosting open-weight models with vLLM, or mixing the two. This page shows you the moving parts and — most importantly for the CLI flow — how those models plug into install-contract.yaml under models: so kindo install --apply can register them with the platform.

What the CLI does for you

During the post-install step, kindo-cli reads the models: array from your install-contract.yaml and registers each entry with the Kindo admin API. After this step:

  1. Models exist in the Model database table with IDs assigned.
  2. Unleash feature flag variants (EMBEDDING_MODELS, AUDIO_TRANSCRIPTION, TOOL_CALLING_MODELS, DEFAULT_WORKFLOW_STEP_MODEL, etc.) are updated to point at those model IDs.
  3. LiteLLM is configured to route traffic to the correct endpoint for each model name.

Your job on this page is to:

  1. Decide which providers/endpoints you will use.
  2. Make sure each endpoint is reachable from the cluster.
  3. Describe them in models: using the contract shape below.

The models: contract shape

models: is a list of dicts. Every entry uses the same top-level keys regardless of provider — only the credential fields change. The CLI writes this during kindo config init; you can also edit by hand and re-validate.

install-contract.yaml
models:
- litellmModelName: claude-sonnet-4-20250514 # unique routing key, matches Unleash variants
litellmModel: anthropic/claude-sonnet-4-20250514 # LiteLLM model ID with provider prefix
displayName: Claude Sonnet 4 # shown in the UI model picker
provider: Anthropic # ModelProvider display name (must match exactly)
creator: Anthropic
type: CHAT # CHAT | INTERNAL
metadataType: Text Generation # Text Generation | Embeddings | Transcription
costTier: MEDIUM # LOW | MEDIUM | HIGH
contextWindow: 200000 # the model's full advertised context window
maxTokens: 16384 # max output tokens
apiKey: ${ANTHROPIC_API_KEY} # or inline secret, or env var
description: Strong general-purpose model for chat and agents
docLink: https://docs.anthropic.com/

Field reference

FieldRequiredWhat it is
litellmModelNameyesUnique routing key. Must match --served-model-name for vLLM and the names referenced by Unleash variants.
litellmModelyesProvider-prefixed LiteLLM model ID (see prefix table below).
displayNameyesHuman-readable name shown in the UI.
provideryesMaps to modelProviderDisplayName — must match an existing ModelProvider row or a new one is created.
creatorrecommendedModel author (Anthropic, OpenAI, NVIDIA, Meta, etc.).
typeyesCHAT for user-facing LLMs, INTERNAL for embeddings/transcription.
metadataTypeyesText Generation, Embeddings, or Transcription. Drives Unleash flag routing.
costTierrecommendedLOW, MEDIUM, or HIGH.
contextWindowyesSet to the model’s full advertised context window — do not low-ball it.
maxTokensrecommendedMaximum output tokens LiteLLM will request.
description / docLinkoptionalSurfaced in the model picker.
CredentialsvariesSee per-provider sections below (apiKey, apiBase, apiVersion, awsRegion, awsAccessKeyId, awsSecretAccessKey, baseModel).

Provider prefix reference

ProviderlitellmModel prefixExample
OpenAIopenai/openai/gpt-4o
Anthropicanthropic/anthropic/claude-sonnet-4-20250514
Google Geminigemini/gemini/gemini-2.5-pro
Azure OpenAIazure/azure/<your-deployment-name>
AWS Bedrockbedrock/bedrock/us.anthropic.claude-sonnet-4-20250514-v1:0
Self-hosted (vLLM)openai/openai/deephat-v2

Cloud provider playbooks

Cloud models are the simplest path — an API key, the right prefix, and network egress. No GPUs, no inference tuning.

models:
- litellmModelName: gpt-4o
litellmModel: openai/gpt-4o
displayName: GPT-4o
provider: OpenAI
creator: OpenAI
type: CHAT
metadataType: Text Generation
costTier: MEDIUM
contextWindow: 128000
maxTokens: 16384
apiKey: sk-proj-...
  • Use the exact model ID OpenAI publishes (e.g. gpt-4o, gpt-4o-mini). Outdated IDs silently fall back or fail.
  • Omit apiBase — LiteLLM routes to https://api.openai.com/v1 by default.
  • Organization ID can be supplied by setting OPENAI_ORG_ID in the LiteLLM environment if needed.

Self-hosted vLLM patterns

Self-hosting gets you full data-residency control and unlimited request volume, but you own the inference server. vLLM is what we recommend, and what this section covers.

Pre-deployment checklist

  • GPUs sized for your target context length (table below).
  • NVIDIA drivers + nvidia-container-runtime installed — nvidia-smi shows your GPUs.
  • Model weights pulled (HuggingFace token exported: export HF_TOKEN=...).
  • Official model card read — context window, max output tokens, dtype, and any special flags.
  • Correct --tool-call-parser identified for your model family.
  • DNS/hostname planned — the endpoint will be reachable from the cluster.

GPU sizing

Context length is the primary VRAM driver via the KV cache. A 70B model at 4k context needs far less VRAM than the same model at 128k.

Model sizeQuantizationShort (4k)Medium (32k)Full (128k+)
7–8BFP16/BF161× 24GB1× 24GB1× 48GB
7–8BFP81× 24GB1× 24GB1× 24GB
13BFP16/BF161× 48GB1× 80GB1× 80GB
13BFP81× 24GB1× 48GB1× 80GB
30–34BFP16/BF161× 80GB2× 80GB2–4× 80GB
30–34BFP81× 48GB1× 80GB1–2× 80GB
70BFP16/BF162× 80GB4× 80GB4–8× 80GB
70BFP81× 80GB2× 80GB2–4× 80GB
70BAWQ/GPTQ1× 80GB2× 80GB4× 80GB

Essential vLLM flags

Terminal window
vllm serve <model-id> \
--served-model-name <name> # must match litellmModelName in the contract
--port 8000 \
--max-model-len <context-length> # the model's full context window
--dtype bfloat16 \
--tensor-parallel-size <num-gpus> \
--tool-call-parser <parser> \ # required for agents
--enable-auto-tool-choice \ # required alongside tool-call-parser
--enable-prefix-caching \
--enable-chunked-prefill

Two flags cause the majority of issues:

  1. --max-model-len — set to the model’s full context. If vLLM OOMs on startup, it will report the maximum your hardware can support; use that number, don’t guess a lower one.
  2. --tool-call-parser — without it, Kindo agents cannot call tools. Always pair with --enable-auto-tool-choice.

Tool-call parser reference

Model family--tool-call-parserNotes
NVIDIA Nemotron 3 Superqwen3_coderAlso set --reasoning-parser nemotron_v3. Requires --trust-remote-code.
Mistral / Mixtralmistral
DeepSeek-V3 / R1hermesVerify against latest vLLM release.
Qwen 3 / Qwen 3 Coderqwen3_coder
DeepHat V2qwen3_coderSee DeepHat section below.
GPT OSSopenaiAlso set --reasoning-parser openai_gptoss.

--served-model-name and the contract

--served-model-name is the string vLLM uses for the model field in its OpenAI-compatible API. Kindo routes requests via LiteLLM, which resolves a litellmModelName to the litellmModel (e.g. openai/my-name), then strips the openai/ prefix and sends "model": "my-name" to your inference server.

The rule: --served-model-name in vLLM must match the suffix of litellmModel and the litellmModelName in your contract.

Terminal window
vllm serve Qwen/Qwen3-Coder-30B-Instruct \
--served-model-name qwen3-coder-30b \
...
models:
- litellmModelName: qwen3-coder-30b # same as --served-model-name
litellmModel: openai/qwen3-coder-30b # same suffix, openai/ prefix
apiBase: http://vllm-qwen.inference.svc.cluster.local:8000/v1
...

Full example: NVIDIA Nemotron 3 Super on 4× H100

Terminal window
vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
--served-model-name nemotron \
--port 8000 \
--max-model-len 1000000 \
--kv-cache-dtype fp8 \
--dtype bfloat16 \
--tensor-parallel-size 4 \
--trust-remote-code \
--tool-call-parser qwen3_coder \
--enable-auto-tool-choice \
--reasoning-parser nemotron_v3

DeepHat

DeepHat V2 is Kindo’s cybersecurity-focused model built for offensive reasoning, long-context analysis, and secure execution. Treat it like any other self-hosted vLLM model in the contract — the specifics below just call out its hardware and flag requirements.

Hardware requirements

DeepHat V2 serves its full 250k-token context on:

  • 1× B200 GPU, or
  • 2× H100 GPUs.

A single H100 can serve at most ~90,000 tokens due to KV-cache memory.

Dependencies

  • vllm>=0.14.1 (brings all implicit dependencies).
  • A HuggingFace access token provisioned by Kindo: export HF_TOKEN=<token>.

Option A: Standalone vLLM

Terminal window
vllm serve DeepHat/DeepHat-V2-ext \
--served-model-name deephat-v2 \
--port 8000 \
--max-model-len 250000 \
--dtype bfloat16 \
--tensor-parallel-size 2 \
--tool-call-parser qwen3_coder \
--enable-prefix-caching \
--enable-chunked-prefill \
--enable-auto-tool-choice

Option B: Optional Helm chart

Skip this if you already have LLM serving infrastructure — run vLLM wherever you want and point Kindo at it. If you want a turnkey in-cluster deployment, Kindo ships a Helm chart.

values-deephat.yaml for 2× H100:

model: DeepHat/DeepHat-V2-ext
servedModelName: deephat-v2
maxModelLen: 250000
dtype: bfloat16
tensorParallelSize: '2'
enableChunkedPrefill: 'true'
enablePrefixCaching: 'true'
enableAutoToolChoice: 'true'
toolCallParser: 'qwen3_coder'
resources:
limits:
nvidia.com/gpu: 2
requests:
nvidia.com/gpu: 2
hfToken: '<your_huggingface_token>'
vllmApiKey: '<your_api_key>'

For 1× B200, set tensorParallelSize: "1" and GPU resources to 1.

Deploy:

Terminal window
helm install deephat-v2 . -f values-deephat.yaml

DeepHat in the contract

models:
- litellmModelName: deephat-v2
litellmModel: openai/deephat-v2
displayName: DeepHat V2
provider: Self-Hosted
creator: Kindo
type: CHAT
metadataType: Text Generation
costTier: LOW
contextWindow: 250000
maxTokens: 16384
apiBase: http://deephat-v2.inference.svc.cluster.local:8000/v1
apiKey: ${VLLM_API_KEY} # whatever you set as vllmApiKey above
description: Kindo's cybersecurity-focused model with 250k context

Verify: Is the endpoint reachable from the cluster?

Always verify the endpoint from inside the cluster before running kindo install --apply. Local curl from your laptop is not sufficient — split DNS and firewall rules catch this out constantly.

Step 1: DNS from a pod

Terminal window
kubectl run dns-test --rm -it --restart=Never \
--image=busybox:1.36 \
-- nslookup <your-inference-host>

Step 2: HTTP /v1/models from a pod

For a cluster-internal vLLM service:

Terminal window
kubectl run net-test --rm -it --restart=Never \
--image=curlimages/curl:8.10.1 \
-- curl -sS -m 5 http://<your-inference-host>:<port>/v1/models

For a cloud provider, use the provider’s host with your key. Example — Anthropic:

Terminal window
kubectl run net-test --rm -it --restart=Never \
--image=curlimages/curl:8.10.1 \
--env ANTHROPIC_API_KEY=sk-ant-... \
-- sh -c 'curl -sS -m 10 https://api.anthropic.com/v1/models \
-H "x-api-key: $ANTHROPIC_API_KEY" \
-H "anthropic-version: 2023-06-01"'

Expected: a JSON response listing the model(s). If this fails, fix connectivity before proceeding — kindo install --apply will fail the same way.

Step 3 (self-hosted only): tool-call smoke test

Terminal window
kubectl run tool-test --rm -it --restart=Never \
--image=curlimages/curl:8.10.1 \
-- curl -sS -m 30 \
-X POST http://<your-inference-host>:<port>/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "<your-served-model-name>",
"messages": [{"role": "user", "content": "What is the weather in SF?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"]
}
}
}],
"tool_choice": "auto",
"max_tokens": 256
}'

Expected: the response contains a tool_calls array with "name": "get_weather" and "location": "San Francisco" (or similar). A plain text response means your --tool-call-parser is missing or wrong — agents will not work.

Common pitfalls

PitfallImpactPrevention
contextWindow too smallConversations truncated, agents lose contextUse the model’s full advertised window
Missing --tool-call-parserAgents cannot call toolsAlways pair with --enable-auto-tool-choice
--served-model-namelitellmModelNameLiteLLM returns 404Keep both strings identical
DNS not reachable from clusterkindo install --apply fails at post-installVerify with kubectl run ... nslookup first
Bedrock model not enabled in AWS403 from LiteLLMEnable via Bedrock → Model access before install
Azure missing apiVersionLiteLLM rejects requestsapiBase, apiVersion, and apiKey are all required
Skipped verificationKindo blamed for inference bugsRun all three verification steps before install

Next

Once models: is filled in and each endpoint passes the kubectl run checks, you’re ready to install.