Observability Guide

Kindo uses OpenTelemetry (OTEL) for unified observability across all services, covering traces, metrics, and logs.

Architecture

+-----------------------------------------------------------+
|                     Kindo Services                        |
+--------+--------+--------+--------+--------+--------------+
|  API   |  Task  | Ext.   |Credits | LiteLLM|  Next.js     |
|        | Worker | Sync   |        |(Python)|              |
+---+----+---+----+---+----+---+----+---+----+---+----------+
    |        |        |        |        |        |
    +--------+--------+--------+--------+--------+
                         |  OTLP
                  +------v------+      +------------+
                  |   Gateway   |<-----+   Agent    |
                  | (OTel coll) |  OTLP|(Prom scrape|
                  +------+------+      |  → gateway)|
                         |             +------------+
                         |
                         v
                +--------+--------+
                | Grafana / Tempo |
                |    backend      |
                +-----------------+

Gateway and Agent are two workloads of the same backend/otel-collector chart, deployed as a single helm release in the kindo-monitoring namespace. Apps emit OTLP to the gateway; the agent scrapes Prometheus targets across the cluster and forwards via OTLP to the same gateway.

Key Principles

Auto-Instrumentation — Node.js services use @opentelemetry/auto-instrumentations-node for automatic HTTP, database, and framework instrumentation.
Custom Instrumentation — Business-critical operations are manually instrumented with custom spans and metrics.
Unified Collection — All telemetry flows through the OTEL Collector for processing and export.

Environment Variables

Core OTEL Configuration

Variable	Description	Required
`OTEL_SERVICE_NAME`	Unique identifier for the service	Yes
`OTEL_EXPORTER_OTLP_ENDPOINT`	OTLP collector endpoint	Yes
`OTEL_EXPORTER_OTLP_PROTOCOL`	Protocol (`http/protobuf` or `grpc`)	No
`OTEL_RESOURCE_ATTRIBUTES`	Additional resource attributes	No
`OTEL_METRICS_EXPORTER`	Metrics exporter (`otlp`, `prometheus`, `none`)	No
`OTEL_TRACES_EXPORTER`	Traces exporter (`otlp`, `none`)	No
`OTEL_LOGS_EXPORTER`	Logs exporter (`otlp`, `none`)	No
`OTEL_METRIC_EXPORT_INTERVAL`	Metrics export interval (ms)	No
`OTEL_BSP_MAX_QUEUE_SIZE`	Max spans queued before dropping	No
`OTEL_BSP_SCHEDULE_DELAY`	Batch export delay (ms)	No
`OTEL_SDK_DISABLED`	Disable OTEL SDK entirely	No

Trace Sampling

Variable	Description	Example
`OTEL_TRACES_SAMPLER`	Sampler type	`always_on`, `traceidratio`
`OTEL_TRACES_SAMPLER_ARG`	Sampler argument	`0.1` (10% sampling)

Service-Specific Configuration

API Service

OTEL_SERVICE_NAME=api
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector-gateway.kindo-monitoring:4317
OTEL_EXPORTER_OTLP_PROTOCOL=grpc
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production

Auto-instruments HTTP, Express, Prisma, Redis, and RabbitMQ. Includes custom Hatchet instrumentation and Winston log correlation.

Task Worker

OTEL_SERVICE_NAME=task-worker-ts
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector-gateway.kindo-monitoring:4317
OTEL_EXPORTER_OTLP_PROTOCOL=grpc

Custom Hatchet SDK instrumentation for distributed workflow tracing. Propagates W3C traceparent through workflow metadata.

LiteLLM

OTEL_SERVICE_NAME=litellm
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector-gateway.kindo-monitoring:4317
OTEL_EXPORTER_OTLP_PROTOCOL=grpc

Python-based service with OpenTelemetry Python instrumentation. Includes custom metrics guardrail and dynamic metrics via Unleash feature flags.

Other Services

All other services (external-sync, credits, external-poller, Next.js) follow the same pattern with their respective OTEL_SERVICE_NAME.

Custom Metrics

Application Metrics

Metric	Type	Description	Labels
`kindo_merge_download_total`	Counter	Merge download requests	—
`kindo_workflow_duplicate_transaction_latency_milliseconds`	Histogram	Duplicate workflow transaction latency	—
`kindo_external_sync_message_processed_total`	Counter	External sync messages processed	`topic`, `event`, `result`
`kindo_api_credits_service_called_total`	Counter	API calls to credits service	—
`kindo_chat_tool_state_conversion_total`	Counter	Tool parts state conversions	—

Grafana Dashboard Metrics

Metric	Description
`kindo_chat_message_total`	Total chat messages processed
`kindo_token_usage_total`	Token consumption
`kindo_ingestion_bytes_total`	Bytes ingested
`kindo_ingestion_duration_seconds_bucket`	Ingestion timing (histogram)
`kindo_unprocessed_file_count_null_plaintext_key`	Files awaiting processing

Auto-Instrumentation Coverage

Library/Framework	What’s Traced
HTTP/HTTPS	Incoming and outgoing requests
Express	Route handlers and middleware
Prisma	Database queries
Redis	Cache operations
RabbitMQ	Message queue operations
gRPC	Service-to-service calls
Winston	Log correlation with trace IDs

Hatchet Workflow Spans

Hatchet workflows are instrumented via HatchetInstrumentor from @hatchet-dev/typescript-sdk/opentelemetry, which ships with the SDK (v1.18.0+). It traces workflow creation, execution, task completion, event publishing, cron scheduling, and admin operations. The previous homegrown @kindo/instrumentation-hatchet package was removed once the upstream instrumentation shipped feature parity.

Span Attribute	Description
`hatchet.workflow_name`	Name of the workflow
`hatchet.workflow_run_id`	Unique run identifier
`hatchet.step_name`	Current step name
`hatchet.task_type`	`durable` or `non_durable`
`hatchet.task_duration`	Execution time (ms)

Durable Chat Spans

The task-worker-ts service adds durable chat domain spans inside the Hatchet workflow shell. These spans use durable_chat.* names for setup, LLM calls, tool execution, step commits, success finalization, and failure finalization.

Span Name	Description
`durable_chat.setup`	Durable chat setup and persisted conversation boundary loading
`durable_chat.llm_call`	Model request and response streaming work
`durable_chat.tool_execution`	Tool execution child work
`durable_chat.commit`	Step commit publication after a successful child result
`durable_chat.finalize_success`	Successful assistant message finalization
`durable_chat.finalize_failure`	Terminal failure persistence

Snake_case attributes are canonical for durable chat trace queries. Important attributes include conversation_id, message_id, durable_chat_task_name, durable_chat_step_kind, durable_chat_logical_step_id, durable_chat_attempt_id, workflow_run_id, parent_workflow_run_id, task_run_id, agent_run_id, org_id, user_id, and user_email when available. The failure-finalization span sets durable_chat_step_kind=finalize_failure and durable_chat_failure=true.

Logging

Log Format

All backend services use structured JSON logging with trace correlation:

{
  "level": "info",
  "message": "Request processed",
  "trace_id": "abc123def456...",
  "span_id": "789xyz...",
  "user_id": "user-1",
  "timestamp": "2024-01-15T10:30:00.000Z"
}

Use snake_case for custom structured log fields and custom telemetry label keys added by application code. For example, prefer user_id, workflow_id, and operation_name over userId, workflowId, and operationName.

Sensitive Headers

These headers are automatically redacted from logs: authorization, cookie, x-api-key, x-connection-config, x-kindo-metadata.

Best Practices

Service Naming

Use consistent, lowercase, hyphenated names: api, task-worker-ts, external-sync, credits, litellm, external-poller.

Resource Attributes

Include deployment context:

OTEL_RESOURCE_ATTRIBUTES="deployment.environment=production,service.namespace=kindo,service.version=1.2.3"

Sampling

For high-traffic services:

OTEL_TRACES_SAMPLER=traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1

Metric Label Cardinality

Avoid user IDs, request IDs, or session IDs as metric labels. Use span attributes for high-cardinality data. Stick to fixed, enumerable values. Use snake_case for custom metric label keys.

Self-managed OTel Collector (SMK)

Two orthogonal gates control observability:

peripheries.otel-collector (install contract) — whether the collector exists. When on, a single otel-collector release deploys to the kindo-monitoring namespace with two workloads: an otel-collector-gateway Deployment (receives OTLP from apps) and an otel-collector-agent Deployment (scrapes infrastructure Prometheus targets and forwards to the gateway). Every app’s OTel SDK flips on too — apps emit OTLP to otel-collector-gateway.kindo-monitoring:4317.
observability.grafana.* (environment bindings) — whether the collector ships out to Grafana. When the credentials are populated, the collector exports to Grafana’s Prometheus remote-write and OTLP endpoints. When empty, the collector still runs but exports only via debug (visible in the pod logs).

You can therefore enable the collector with no backend (useful for verifying app emission before wiring up storage), or skip the collector entirely (peripheries.otel-collector: false). The chart’s default config builder only knows about the Grafana exporters today — pointing the collector at a non-Grafana backend means supplying a hand-written gateway.config via the chart’s escape hatch, not flipping a value.

Wiring up Grafana

Add these under observability.grafana in environment-bindings.yaml:

observability:
  grafana:
    prometheusRemoteWriteEndpoint: 'https://prometheus-prod-XX.grafana.net/api/prom/push'
    prometheusRemoteWriteUsername: '<grafana cloud datasource id>'
    otlpEndpoint: 'https://otlp-gateway-prod-XX.grafana.net/otlp'
    otlpUsername: '<grafana cloud OTLP datasource id>'
    authPassword: '<grafana cloud API token>'
    clusterLabel: '<your cluster identifier>'

clusterLabel is shipped as the cluster Prometheus external label on every metric and is the value you’d use to group/filter multi-cluster dashboards. Use a stable per-cluster identifier (e.g. acme-prod, acme-staging).

The CLI projects this block into the cluster-resident secrets file as top-level grafana.* (so the helmfile values overlay can read them directly). If you populate the secrets file by hand via kindo config edit, use the top-level grafana: form to match.

All six fields are validated as an all-or-nothing set — partial population (e.g. four endpoints filled in but authPassword empty) is rejected at config-load time by GrafanaBindings.validate_grafana_completeness. Either fill them all in or leave them all empty.

Chart shape

backend/otel-collector/helm_chart is a single chart that renders both workloads in one release:

Gateway (always) — Deployment + Service + ConfigMap (and an HPA when gateway.autoscaling.enabled: true). Receives OTLP from apps and exports to a backend. Built from grafana.*, cluster.*, grafanaCloudAPM.enabled values.
Agent (gated on agent.enabled, default true) — Deployment + ConfigMap (no Service; agent is outbound-only). Scrapes Prometheus targets in the cluster and forwards via OTLP to the gateway. Built from agent.scrape.* toggles; the gateway endpoint is auto-derived from the chart’s own fullname helpers, so no manual wiring.

SMK deploys via peripheries.otel-collector: true, which produces one otel-collector release with both workloads. SaaS migrates to a similar one-release pattern with the same chart — set gateway.config if SaaS needs to keep its hand-written gateway config.

The SMK values overlay lives at tools/kindo-cli/deploy/values/otel-collector.yaml.gotmpl. It reads Grafana credentials from the cluster-resident secrets file. Only the Hatchet scrape is on by default in the agent (always in-cluster on SMK); RabbitMQ, Redis, ingress-nginx, and the K8s infra scrapes (cadvisor, kubelet, kube-state-metrics, node-exporter) are off — customers opt in by overriding the relevant agent.scrape.*.enabled flag. Static-target scrapes with enabled: true but an empty target are skipped silently at render time.

gateway.config and agent.config are per-workload escape hatches: when non-empty, the chart’s structured builder is skipped and that value is used as the collector config verbatim.

Troubleshooting

Symptom	Cause	Solution
No traces	SDK disabled	Check `OTEL_SDK_DISABLED` is not `true`
No traces	Wrong endpoint	Verify endpoint is reachable
Disconnected traces	Context not propagated	Ensure trace headers pass between services
No metrics in Grafana	Exporter disabled	Set `OTEL_METRICS_EXPORTER=otlp`
OOM errors	Large span queue	Reduce `OTEL_BSP_MAX_QUEUE_SIZE`
No trace IDs in logs	Wrong logger	Use `@kindo/observability` logger

Quick Reference

In an SMK install where peripheries.otel-collector: true, the gateway runs in the kindo-monitoring namespace and kindo-cli injects the OTel SDK env vars on every Kindo app pod automatically — you don’t need to set these by hand. The exact set injected (see _otel_env in tools/kindo-cli/src/kindo_cli/config/secret_data.py):

OTEL_SDK_DISABLED=false
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector-gateway.kindo-monitoring:4317
OTEL_EXPORTER_OTLP_PROTOCOL=grpc
OTEL_METRICS_EXPORTER=otlp
OTEL_TRACES_EXPORTER=otlp
OTEL_LOGS_EXPORTER=otlp
OTEL_METRIC_EXPORT_INTERVAL=60000
OTEL_BSP_MAX_QUEUE_SIZE=4096
OTEL_BSP_SCHEDULE_DELAY=1000
OTEL_SERVICE_NAME=<service-name>
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=<env>,service.namespace=kindo

When peripheries.otel-collector: false, the kill-switch + exporter-disable set ships instead:

OTEL_SDK_DISABLED=true
OTEL_EXPORTER_OTLP_ENDPOINT=
OTEL_METRICS_EXPORTER=none
OTEL_TRACES_EXPORTER=none
OTEL_LOGS_EXPORTER=none

If you have custom apps that need to ship to the same gateway, set OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector-gateway.kindo-monitoring:4317 and OTEL_EXPORTER_OTLP_PROTOCOL=grpc on them too.