Deploying DeepHat

Prev Next

Created: 01/22/2026

Last Edited: 02/02/2026

This doc should be treated as a test draft. It should be updated with new information once it has been put to use with an actual customer.

Deep Hat, formerly Whiterabbitneo, is Kindo's uncensored cybersecurity model built for real offensive reasoning, long-context analysis, and secure execution.

Hardware requirements

DeepHat2 V2 is verified to run and serve full context length on 2xH100s or 1xB200

Dependencies

Verify the following packages are installed on the hardware:

  • vllm>=0.12.0

Environment Setup

A fine-grained access token provided by Kindo will be necessary to pull the model weights from HuggingFace. This token should be securely stored and injected into the environment as an environment variable:

export HF_TOKEN=<your_access_token>

Deployment

Run the vllm serve command with the following arguments depending on your hardware - the difference is the number value for tensor_parallel_size .

2xH100:

vllm serve DeepHat/DeepHat-V2-ext \
--served-model-name deephat-v2 \
--port <port_you_choose> \
--max_model_len 250000 \
--dtype bfloat16 \
--tensor_parallel_size 2 \
--tool_call_parser qwen3_coder \
--enable-prefix-caching \
--enable-chunked-prefill \
--enable-auto-tool-choice

1xB200:

vllm serve DeepHat/DeepHat-V2-ext \
--served-model-name deephat-v2 \
--port <port_you_choose> \
--max_model_len 250000 \
--dtype bfloat16 \
--tensor_parallel_size 1 \
--tool_call_parser qwen3_coder \
--enable-prefix-caching \
--enable-chunked-prefill \
--enable-auto-tool-choice

Once the server is up and running, try a sample request:

curl -X POST "<server_base_url>:<port_you_choose>/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{"model": "deephat-v2", "stream": false, "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Who are you?"}], "max_tokens": 1024, "temperature": 0.7}'

# Hosting DeepHat V2 via kindo helm chart

Note: This Helm chart is optional and provided by Kindo as a convenience for customers who do not have their own LLM serving infrastructure. If you already have a preferred method for running LLM inference (e.g., existing vLLM deployment, custom orchestration, or other serving solutions), you can use that instead. The instructions below describe how to run DeepHat V2 either standalone or using this Helm chart.

Overview

This section covers the setup and commands to run a vLLM server serving DeepHat V2.

Important: This document does NOT cover:

  • Recovery/availability of the server should it crash
  • Exposing the inference endpoint to external applications outside of the server's network
  • Securing the inference endpoint

Hardware Requirements

DeepHat V2 is verified to run and serve full context length (250k) on:

  • 1x B200 GPU
  • 2x H100 GPUs

1x H100 can serve at most 90,000 tokens as the max context length due to KV-cache memory requirements.

The model is natively in bfloat16 :

  1. The memory requirement to fit the model weights of DHv2 is 63GB.
  2. The memory requirement to fit the KV-cache of DHv2 is 96KiB/Token.
  3. The default memory allocation for vllm is 90%.

The minimum memory needed to serve a single request of 250,000 tokens is therefore:

Use the following formula to determine the number of tokens that can fit into KV-cache:

While vllm enforces a strict hardware check at launch to prevent potential overflows, this is just a worst-case baseline. During actual inference, it uses dynamic memory allocation rather than static reservation. This ensures that memory is consumed only by the actual tokens used (both input and output), maximizing available RAM for parallel user requests.

Dependencies

Verify the following packages are installed on the hardware:

  • vllm>=0.12.0

Note: vllm is the only explicitly required package. Installing vllm will bring along all of its implicit dependencies.

Environment Setup

A fine-grained access token provided by Kindo is necessary to pull the model weights from HuggingFace. This token should be securely stored and injected into the environment as an environment variable:

export HF_TOKEN=<your_access_token>

Deployment Options

Option 1: Standalone vLLM Deployment (Without Helm)

If you prefer to run vLLM directly without Kubernetes or this Helm chart, use the following commands depending on your hardware. The difference is the tensor_parallel_size value.

2x H100:

vllm serve DeepHat/DeepHat-V2-ext \
  --served-model-name deephat-v2 \
  --port <port_you_choose> \
  --max_model_len 250000 \
  --dtype bfloat16 \
  --tensor_parallel_size 2 \
  --tool_call_parser qwen3_coder \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --enable-auto-tool-choice

1x B200:

vllm serve DeepHat/DeepHat-V2-ext \
  --served-model-name deephat-v2 \
  --port <port_you_choose> \
  --max_model_len 250000 \
  --dtype bfloat16 \
  --tensor_parallel_size 1 \
  --tool_call_parser qwen3_coder \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --enable-auto-tool-choice

Option 2: Using the Kindo Helm Chart

If you want to deploy DeepHat V2 using Kubernetes and this Helm chart, create a values-deephat.yaml file:

For 2x H100:

model: DeepHat/DeepHat-V2-ext
servedModelName: deephat-v2
maxModelLen: 250000
dtype: bfloat16
tensorParallelSize: "2"
enableChunkedPrefill: "true"
enablePrefixCaching: "true"
enableAutoToolChoice: "true"
toolCallParser: "qwen3_coder"

resources:
  limits:
    nvidia.com/gpu: 2
  requests:
    nvidia.com/gpu: 2

hfToken: "<your_huggingface_token>"
vllmApiKey: "<your_api_key>"

For 1x B200:

model: DeepHat/DeepHat-V2-ext
servedModelName: deephat-v2
maxModelLen: 250000
dtype: bfloat16
tensorParallelSize: "1"
enableChunkedPrefill: "true"
enablePrefixCaching: "true"
enableAutoToolChoice: "true"
toolCallParser: "qwen3_coder"

resources:
  limits:
    nvidia.com/gpu: 1
  requests:
    nvidia.com/gpu: 1

hfToken: "<your_huggingface_token>"
vllmApiKey: "<your_api_key>"

Then deploy with:

helm install deephat-v2 . -f values-deephat.yaml

Verifying the Deployment

Once the server is up and running, try a sample request:

curl -X POST "<server_base_url>:<port_you_choose>/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deephat-v2",
    "stream": false,
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Who are you?"}
    ],
    "max_tokens": 1024,
    "temperature": 0.7
  }'

Important: The value provided for --served-model-name (or servedModelName in values.yaml) must match the name used when adding DeepHat as a self-managed model in Kindo.