Prepare AI Models

Self-Managed Kindo does not ship AI models — you provide the inference infrastructure. That means picking cloud API providers (OpenAI, Anthropic, Gemini, Bedrock, Azure), self-hosting open-weight models with vLLM, or mixing the two. This page shows you the moving parts and — most importantly for the CLI flow — how those models plug into install-contract.yaml under models: so kindo install --apply can register them with the platform.

What the CLI does for you

During the post-install step, kindo-cli reads the models: array from your install-contract.yaml and registers each entry with the Kindo admin API. After this step:

Models exist in the Model database table with IDs assigned.
Unleash feature flag variants (EMBEDDING_MODELS, AUDIO_TRANSCRIPTION, TOOL_CALLING_MODELS, DEFAULT_WORKFLOW_STEP_MODEL, etc.) are updated to point at those model IDs.
LiteLLM is configured to route traffic to the correct endpoint for each model name.

Your job on this page is to:

Decide which providers/endpoints you will use.
Make sure each endpoint is reachable from the cluster.
Describe them in models: using the contract shape below.

The `models:` contract shape

models: is a list of dicts. Every entry uses the same top-level keys regardless of provider — only the credential fields change. The CLI writes this during kindo config init; you can also edit by hand and re-validate.

models:
  - litellmModelName: claude-sonnet-4-20250514 # unique routing key, matches Unleash variants
    litellmModel: anthropic/claude-sonnet-4-20250514 # LiteLLM model ID with provider prefix
    displayName: Claude Sonnet 4 # shown in the UI model picker
    provider: Anthropic # ModelProvider display name (must match exactly)
    creator: Anthropic
    type: CHAT # CHAT | INTERNAL
    metadataType: Text Generation # Text Generation | Embeddings | Transcription
    costTier: MEDIUM # LOW | MEDIUM | HIGH
    contextWindow: 200000 # the model's full advertised context window
    maxTokens: 16384 # max output tokens
    apiKey: ${ANTHROPIC_API_KEY} # or inline secret, or env var
    description: Strong general-purpose model for chat and agents
    docLink: https://docs.anthropic.com/

Field reference

Field	Required	What it is
`litellmModelName`	yes	Unique routing key. Must match `--served-model-name` for vLLM and the names referenced by Unleash variants.
`litellmModel`	yes	Provider-prefixed LiteLLM model ID (see prefix table below).
`displayName`	yes	Human-readable name shown in the UI.
`provider`	yes	Maps to `modelProviderDisplayName` — must match an existing `ModelProvider` row or a new one is created.
`creator`	recommended	Model author (Anthropic, OpenAI, NVIDIA, Meta, etc.).
`type`	yes	`CHAT` for user-facing LLMs, `INTERNAL` for embeddings/transcription.
`metadataType`	yes	`Text Generation`, `Embeddings`, or `Transcription`. Drives Unleash flag routing.
`costTier`	recommended	`LOW`, `MEDIUM`, or `HIGH`.
`contextWindow`	yes	Set to the model’s full advertised context window — do not low-ball it.
`maxTokens`	recommended	Maximum output tokens LiteLLM will request.
`description` / `docLink`	optional	Surfaced in the model picker.
Credentials	varies	See per-provider sections below (`apiKey`, `apiBase`, `apiVersion`, `awsRegion`, `awsAccessKeyId`, `awsSecretAccessKey`, `baseModel`).

Set both contextWindow and maxTokens

These two fields are independent and both must be set correctly.

contextWindow — total tokens the model can read at once (prompt + conversation history + output). Setting contextWindow: 8192 on a model that supports 200k or 1M tokens will silently truncate long conversations, RAG contexts, and agent tool outputs. Always use the provider’s documented limit.
maxTokens — the upper bound on tokens the model is allowed to generate in a single response. LiteLLM sends this value to the provider. Set too low and long responses, code generations, and multi-step tool calls get cut off mid-sentence; set higher than the model allows and the call is rejected.

Rule of thumb: contextWindow = the provider’s stated limit; maxTokens = the provider’s stated max output size (not the context window size).

Provider prefix reference

Provider	`litellmModel` prefix	Example
OpenAI	`openai/`	`openai/gpt-4o`
Anthropic	`anthropic/`	`anthropic/claude-sonnet-4-20250514`
Google Gemini	`gemini/`	`gemini/gemini-2.5-pro`
Azure OpenAI	`azure/`	`azure/<your-deployment-name>`
AWS Bedrock	`bedrock/`	`bedrock/us.anthropic.claude-sonnet-4-20250514-v1:0`
Self-hosted (vLLM)	`openai/`	`openai/deephat-v2`

Cloud provider playbooks

Cloud models are the simplest path — an API key, the right prefix, and network egress. No GPUs, no inference tuning.

models:
  - litellmModelName: gpt-4o
    litellmModel: openai/gpt-4o
    displayName: GPT-4o
    provider: OpenAI
    creator: OpenAI
    type: CHAT
    metadataType: Text Generation
    costTier: MEDIUM
    contextWindow: 128000
    maxTokens: 16384
    apiKey: sk-proj-...

Use the exact model ID OpenAI publishes (e.g. gpt-4o, gpt-4o-mini). Outdated IDs silently fall back or fail.
Omit apiBase — LiteLLM routes to https://api.openai.com/v1 by default.
Organization ID can be supplied by setting OPENAI_ORG_ID in the LiteLLM environment if needed.

models:
  - litellmModelName: claude-sonnet-4-20250514
    litellmModel: anthropic/claude-sonnet-4-20250514
    displayName: Claude Sonnet 4
    provider: Anthropic
    creator: Anthropic
    type: CHAT
    metadataType: Text Generation
    costTier: MEDIUM
    contextWindow: 200000
    maxTokens: 16384
    apiKey: sk-ant-...

Use the dated model ID (e.g. claude-sonnet-4-20250514), not an alias, for reproducible deployments.

models:
  - litellmModelName: gemini-2.5-pro
    litellmModel: gemini/gemini-2.5-pro
    displayName: Gemini 2.5 Pro
    provider: Google
    creator: Google
    type: CHAT
    metadataType: Text Generation
    costTier: MEDIUM
    contextWindow: 1000000
    maxTokens: 65536
    apiKey: AIza...

models:
  - litellmModelName: claude-sonnet-4-bedrock
    litellmModel: bedrock/us.anthropic.claude-sonnet-4-20250514-v1:0
    displayName: Claude Sonnet 4 (Bedrock)
    provider: AWS Bedrock
    creator: Anthropic
    type: CHAT
    metadataType: Text Generation
    costTier: MEDIUM
    contextWindow: 200000
    maxTokens: 16384
    awsRegion: us-west-2
    awsAccessKeyId: AKIA...
    awsSecretAccessKey: ...

Enable the model in the AWS Console under Bedrock → Model access before use.
Use cross-region inference profiles (e.g. us.anthropic.claude-*) instead of region-specific ARNs for better availability.
IAM requires bedrock:InvokeModel plus aws-marketplace:ViewSubscriptions and aws-marketplace:Subscribe.
If you leave awsAccessKeyId / awsSecretAccessKey empty, LiteLLM falls back to the IRSA role on the cluster — only do this if your service account has Bedrock permissions.

models:
  - litellmModelName: azure-gpt-4o
    litellmModel: azure/my-gpt4o-deployment
    displayName: GPT-4o (Azure)
    provider: Azure OpenAI
    creator: OpenAI
    type: CHAT
    metadataType: Text Generation
    costTier: MEDIUM
    contextWindow: 128000
    maxTokens: 16384
    apiBase: https://my-resource.openai.azure.com
    apiVersion: '2024-10-21'
    apiKey: ...
    baseModel: gpt-4o # optional: LiteLLM model_info.base_model for pricing

litellmModel uses your Azure deployment name, not the OpenAI model name.
apiBase, apiVersion, and apiKey are all required — Azure will not route without them.
Set baseModel to the underlying OpenAI model (gpt-4o, gpt-4o-mini, etc.) so LiteLLM can report correct pricing and capabilities.

Self-hosted vLLM patterns

Self-hosting gets you full data-residency control and unlimited request volume, but you own the inference server. vLLM is what we recommend, and what this section covers.

Pre-deployment checklist

GPUs sized for your target context length (table below).
NVIDIA drivers + nvidia-container-runtime installed — nvidia-smi shows your GPUs.
Model weights pulled (HuggingFace token exported: export HF_TOKEN=...).
Official model card read — context window, max output tokens, dtype, and any special flags.
Correct --tool-call-parser identified for your model family.
DNS/hostname planned — the endpoint will be reachable from the cluster.

GPU sizing

Context length is the primary VRAM driver via the KV cache. A 70B model at 4k context needs far less VRAM than the same model at 128k.

Model size	Quantization	Short (4k)	Medium (32k)	Full (128k+)
7–8B	FP16/BF16	1× 24GB	1× 24GB	1× 48GB
7–8B	FP8	1× 24GB	1× 24GB	1× 24GB
13B	FP16/BF16	1× 48GB	1× 80GB	1× 80GB
13B	FP8	1× 24GB	1× 48GB	1× 80GB
30–34B	FP16/BF16	1× 80GB	2× 80GB	2–4× 80GB
30–34B	FP8	1× 48GB	1× 80GB	1–2× 80GB
70B	FP16/BF16	2× 80GB	4× 80GB	4–8× 80GB
70B	FP8	1× 80GB	2× 80GB	2–4× 80GB
70B	AWQ/GPTQ	1× 80GB	2× 80GB	4× 80GB

Essential vLLM flags

vllm serve <model-id> \
  --served-model-name <name>          # must match litellmModelName in the contract
  --port 8000 \
  --max-model-len <context-length>    # the model's full context window
  --dtype bfloat16 \
  --tensor-parallel-size <num-gpus> \
  --tool-call-parser <parser> \       # required for agents
  --enable-auto-tool-choice \         # required alongside tool-call-parser
  --enable-prefix-caching \
  --enable-chunked-prefill

Two flags cause the majority of issues:

--max-model-len — set to the model’s full context. If vLLM OOMs on startup, it will report the maximum your hardware can support; use that number, don’t guess a lower one.
--tool-call-parser — without it, Kindo agents cannot call tools. Always pair with --enable-auto-tool-choice.

Tool-call parser reference

Model family	`--tool-call-parser`	Notes
NVIDIA Nemotron 3 Super	`qwen3_coder`	Also set `--reasoning-parser nemotron_v3`. Requires `--trust-remote-code`.
Mistral / Mixtral	`mistral`
DeepSeek-V3 / R1	`hermes`	Verify against latest vLLM release.
Qwen 3 / Qwen 3 Coder	`qwen3_coder`
DeepHat V2	`qwen3_coder`	See DeepHat section below.
GPT OSS	`openai`	Also set `--reasoning-parser openai_gptoss`.

`--served-model-name` and the contract

--served-model-name is the string vLLM uses for the model field in its OpenAI-compatible API. Kindo routes requests via LiteLLM, which resolves a litellmModelName to the litellmModel (e.g. openai/my-name), then strips the openai/ prefix and sends "model": "my-name" to your inference server.

The rule: --served-model-name in vLLM must match the suffix of litellmModel and the litellmModelName in your contract.

vllm serve Qwen/Qwen3-Coder-30B-Instruct \
  --served-model-name qwen3-coder-30b \
  ...

models:
  - litellmModelName: qwen3-coder-30b        # same as --served-model-name
    litellmModel: openai/qwen3-coder-30b     # same suffix, openai/ prefix
    apiBase: http://vllm-qwen.inference.svc.cluster.local:8000/v1
    ...

Full example: NVIDIA Nemotron 3 Super on 4× H100

vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
  --served-model-name nemotron \
  --port 8000 \
  --max-model-len 1000000 \
  --kv-cache-dtype fp8 \
  --dtype bfloat16 \
  --tensor-parallel-size 4 \
  --trust-remote-code \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --reasoning-parser nemotron_v3

DeepHat

DeepHat V2 is Kindo’s cybersecurity-focused model built for offensive reasoning, long-context analysis, and secure execution. Treat it like any other self-hosted vLLM model in the contract — the specifics below just call out its hardware and flag requirements.

Hardware requirements

DeepHat V2 serves its full 250k-token context on:

1× B200 GPU, or
2× H100 GPUs.

A single H100 can serve at most ~90,000 tokens due to KV-cache memory.

Dependencies

vllm>=0.14.1 (brings all implicit dependencies).
A HuggingFace access token provisioned by Kindo: export HF_TOKEN=<token>.

vllm serve DeepHat/DeepHat-V2-ext \
  --served-model-name deephat-v2 \
  --port 8000 \
  --max-model-len 250000 \
  --dtype bfloat16 \
  --tensor-parallel-size 2 \
  --tool-call-parser qwen3_coder \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --enable-auto-tool-choice

vllm serve DeepHat/DeepHat-V2-ext \
  --served-model-name deephat-v2 \
  --port 8000 \
  --max-model-len 250000 \
  --dtype bfloat16 \
  --tensor-parallel-size 1 \
  --tool-call-parser qwen3_coder \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --enable-auto-tool-choice

Option B: Optional Helm chart

Skip this if you already have LLM serving infrastructure — run vLLM wherever you want and point Kindo at it. If you want a turnkey in-cluster deployment, Kindo ships a Helm chart.

values-deephat.yaml for 2× H100:

model: DeepHat/DeepHat-V2-ext
servedModelName: deephat-v2
maxModelLen: 250000
dtype: bfloat16
tensorParallelSize: '2'
enableChunkedPrefill: 'true'
enablePrefixCaching: 'true'
enableAutoToolChoice: 'true'
toolCallParser: 'qwen3_coder'

resources:
  limits:
    nvidia.com/gpu: 2
  requests:
    nvidia.com/gpu: 2

hfToken: '<your_huggingface_token>'
vllmApiKey: '<your_api_key>'

For 1× B200, set tensorParallelSize: "1" and GPU resources to 1.

Deploy:

helm install deephat-v2 . -f values-deephat.yaml

DeepHat in the contract

models:
  - litellmModelName: deephat-v2
    litellmModel: openai/deephat-v2
    displayName: DeepHat V2
    provider: Self-Hosted
    creator: Kindo
    type: CHAT
    metadataType: Text Generation
    costTier: LOW
    contextWindow: 250000
    maxTokens: 16384
    apiBase: http://deephat-v2.inference.svc.cluster.local:8000/v1
    apiKey: ${VLLM_API_KEY} # whatever you set as vllmApiKey above
    description: Kindo's cybersecurity-focused model with 250k context

Verify: Is the endpoint reachable from the cluster?

Always verify the endpoint from inside the cluster before running kindo install --apply. Local curl from your laptop is not sufficient — split DNS and firewall rules catch this out constantly.

Step 1: DNS from a pod

kubectl run dns-test --rm -it --restart=Never \
  --image=busybox:1.36 \
  -- nslookup <your-inference-host>

Step 2: HTTP `/v1/models` from a pod

For a cluster-internal vLLM service:

kubectl run net-test --rm -it --restart=Never \
  --image=curlimages/curl:8.10.1 \
  -- curl -sS -m 5 http://<your-inference-host>:<port>/v1/models

For a cloud provider, use the provider’s host with your key. Example — Anthropic:

kubectl run net-test --rm -it --restart=Never \
  --image=curlimages/curl:8.10.1 \
  --env ANTHROPIC_API_KEY=sk-ant-... \
  -- sh -c 'curl -sS -m 10 https://api.anthropic.com/v1/models \
    -H "x-api-key: $ANTHROPIC_API_KEY" \
    -H "anthropic-version: 2023-06-01"'

Expected: a JSON response listing the model(s). If this fails, fix connectivity before proceeding — kindo install --apply will fail the same way.

Step 3 (self-hosted only): tool-call smoke test

kubectl run tool-test --rm -it --restart=Never \
  --image=curlimages/curl:8.10.1 \
  -- curl -sS -m 30 \
    -X POST http://<your-inference-host>:<port>/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "<your-served-model-name>",
      "messages": [{"role": "user", "content": "What is the weather in SF?"}],
      "tools": [{
        "type": "function",
        "function": {
          "name": "get_weather",
          "description": "Get current weather for a location",
          "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string"}},
            "required": ["location"]
          }
        }
      }],
      "tool_choice": "auto",
      "max_tokens": 256
    }'

Expected: the response contains a tool_calls array with "name": "get_weather" and "location": "San Francisco" (or similar). A plain text response means your --tool-call-parser is missing or wrong — agents will not work.

Common pitfalls

Pitfall	Impact	Prevention
`contextWindow` too small	Conversations truncated, agents lose context	Use the model’s full advertised window
Missing `--tool-call-parser`	Agents cannot call tools	Always pair with `--enable-auto-tool-choice`
`--served-model-name` ≠ `litellmModelName`	LiteLLM returns 404	Keep both strings identical
DNS not reachable from cluster	`kindo install --apply` fails at `post-install`	Verify with `kubectl run ... nslookup` first
Bedrock model not enabled in AWS	403 from LiteLLM	Enable via Bedrock → Model access before install
Azure missing `apiVersion`	LiteLLM rejects requests	`apiBase`, `apiVersion`, and `apiKey` are all required
Skipped verification	Kindo blamed for inference bugs	Run all three verification steps before install

Once models: is filled in and each endpoint passes the kubectl run checks, you’re ready to install.

5. Install with kindo-cli Run the config wizard, preflight, plan, and apply the 10 install steps.