Skip to content

Observability Guide

Kindo uses OpenTelemetry (OTEL) for unified observability across all services, covering traces, metrics, and logs.

Architecture

+-----------------------------------------------------------+
| Kindo Services |
+--------+--------+--------+--------+--------+--------------+
| API | Task | Ext. |Credits | LiteLLM| Next.js |
| | Worker | Sync | |(Python)| |
+---+----+---+----+---+----+---+----+---+----+---+----------+
| | | | | |
+--------+--------+--------+--------+--------+
| OTLP
+------v------+ +------------+
| Gateway |<-----+ Agent |
| (OTel coll) | OTLP|(Prom scrape|
+------+------+ | → gateway)|
| +------------+
|
v
+--------+--------+
| Grafana / Tempo |
| backend |
+-----------------+

Gateway and Agent are two workloads of the same backend/otel-collector chart, deployed as a single helm release in the kindo-monitoring namespace. Apps emit OTLP to the gateway; the agent scrapes Prometheus targets across the cluster and forwards via OTLP to the same gateway.

Key Principles

  1. Auto-Instrumentation — Node.js services use @opentelemetry/auto-instrumentations-node for automatic HTTP, database, and framework instrumentation.
  2. Custom Instrumentation — Business-critical operations are manually instrumented with custom spans and metrics.
  3. Unified Collection — All telemetry flows through the OTEL Collector for processing and export.

Environment Variables

Core OTEL Configuration

VariableDescriptionRequired
OTEL_SERVICE_NAMEUnique identifier for the serviceYes
OTEL_EXPORTER_OTLP_ENDPOINTOTLP collector endpointYes
OTEL_EXPORTER_OTLP_PROTOCOLProtocol (http/protobuf or grpc)No
OTEL_RESOURCE_ATTRIBUTESAdditional resource attributesNo
OTEL_METRICS_EXPORTERMetrics exporter (otlp, prometheus, none)No
OTEL_TRACES_EXPORTERTraces exporter (otlp, none)No
OTEL_LOGS_EXPORTERLogs exporter (otlp, none)No
OTEL_METRIC_EXPORT_INTERVALMetrics export interval (ms)No
OTEL_BSP_MAX_QUEUE_SIZEMax spans queued before droppingNo
OTEL_BSP_SCHEDULE_DELAYBatch export delay (ms)No
OTEL_SDK_DISABLEDDisable OTEL SDK entirelyNo

Trace Sampling

VariableDescriptionExample
OTEL_TRACES_SAMPLERSampler typealways_on, traceidratio
OTEL_TRACES_SAMPLER_ARGSampler argument0.1 (10% sampling)

Service-Specific Configuration

API Service

OTEL_SERVICE_NAME=api
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector-gateway.kindo-monitoring:4317
OTEL_EXPORTER_OTLP_PROTOCOL=grpc
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production

Auto-instruments HTTP, Express, Prisma, Redis, and RabbitMQ. Includes custom Hatchet instrumentation and Winston log correlation.

Task Worker

OTEL_SERVICE_NAME=task-worker-ts
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector-gateway.kindo-monitoring:4317
OTEL_EXPORTER_OTLP_PROTOCOL=grpc

Custom Hatchet SDK instrumentation for distributed workflow tracing. Propagates W3C traceparent through workflow metadata.

LiteLLM

OTEL_SERVICE_NAME=litellm
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector-gateway.kindo-monitoring:4317
OTEL_EXPORTER_OTLP_PROTOCOL=grpc

Python-based service with OpenTelemetry Python instrumentation. Includes custom metrics guardrail and dynamic metrics via Unleash feature flags.

Other Services

All other services (external-sync, credits, audit-log-exporter, external-poller, Next.js) follow the same pattern with their respective OTEL_SERVICE_NAME.

Custom Metrics

Application Metrics

MetricTypeDescriptionLabels
kindo_merge_download_totalCounterMerge download requests
kindo_workflow_duplicate_transaction_latency_millisecondsHistogramDuplicate workflow transaction latency
kindo_external_sync_message_processed_totalCounterExternal sync messages processedtopic, event, result
kindo_api_credits_service_called_totalCounterAPI calls to credits service
kindo_chat_tool_state_conversion_totalCounterTool parts state conversions

Grafana Dashboard Metrics

MetricDescription
kindo_chat_message_totalTotal chat messages processed
kindo_token_usage_totalToken consumption
kindo_ingestion_bytes_totalBytes ingested
kindo_ingestion_duration_seconds_bucketIngestion timing (histogram)
kindo_unprocessed_file_count_null_plaintext_keyFiles awaiting processing

Auto-Instrumentation Coverage

Library/FrameworkWhat’s Traced
HTTP/HTTPSIncoming and outgoing requests
ExpressRoute handlers and middleware
PrismaDatabase queries
RedisCache operations
RabbitMQMessage queue operations
gRPCService-to-service calls
WinstonLog correlation with trace IDs

Hatchet Workflow Spans

Hatchet workflows are instrumented via HatchetInstrumentor from @hatchet-dev/typescript-sdk/opentelemetry, which ships with the SDK (v1.18.0+). It traces workflow creation, execution, task completion, event publishing, cron scheduling, and admin operations. The previous homegrown @kindo/instrumentation-hatchet package was removed once the upstream instrumentation shipped feature parity.

Span AttributeDescription
hatchet.workflow_nameName of the workflow
hatchet.workflow_run_idUnique run identifier
hatchet.step_nameCurrent step name
hatchet.task_typedurable or non_durable
hatchet.task_durationExecution time (ms)

Durable Chat Spans

The task-worker-ts service adds durable chat domain spans inside the Hatchet workflow shell. These spans use durable_chat.* names for setup, LLM calls, tool execution, step commits, success finalization, and failure finalization.

Span NameDescription
durable_chat.setupDurable chat setup and persisted conversation boundary loading
durable_chat.llm_callModel request and response streaming work
durable_chat.tool_executionTool execution child work
durable_chat.commitStep commit publication after a successful child result
durable_chat.finalize_successSuccessful assistant message finalization
durable_chat.finalize_failureTerminal failure persistence

Snake_case attributes are canonical for durable chat trace queries. Important attributes include conversation_id, message_id, task_name, durable_chat_step_kind, logical_step_id, workflow_run_id, current_workflow_run_id, parent_workflow_run_id, attempt_id, and agent_run_id. The failure-finalization span sets durable_chat_step_kind=finalize_failure and durable_chat_failure=true.

Some traces also include camelCase aliases such as conversationId, messageId, and workflowRunId for compatibility with legacy Tempo queries. Prefer the snake_case attributes in new dashboards and runbooks.

Logging

Log Format

All backend services use structured JSON logging with trace correlation:

{
"level": "info",
"message": "Request processed",
"trace_id": "abc123def456...",
"span_id": "789xyz...",
"user_id": "user-1",
"timestamp": "2024-01-15T10:30:00.000Z"
}

Use snake_case for custom structured log fields and custom telemetry label keys added by application code. For example, prefer user_id, workflow_id, and operation_name over userId, workflowId, and operationName.

Sensitive Headers

These headers are automatically redacted from logs: authorization, cookie, x-api-key, x-connection-config, x-kindo-metadata.

Best Practices

Service Naming

Use consistent, lowercase, hyphenated names: api, task-worker-ts, external-sync, credits, litellm, audit-log-exporter, external-poller.

Resource Attributes

Include deployment context:

OTEL_RESOURCE_ATTRIBUTES="deployment.environment=production,service.namespace=kindo,service.version=1.2.3"

Sampling

For high-traffic services:

OTEL_TRACES_SAMPLER=traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1

Metric Label Cardinality

Avoid user IDs, request IDs, or session IDs as metric labels. Use span attributes for high-cardinality data. Stick to fixed, enumerable values. Use snake_case for custom metric label keys.

Self-managed OTel Collector (SMK)

Two orthogonal gates control observability:

  1. peripheries.otel-collector (install contract) — whether the collector exists. When on, a single otel-collector release deploys to the kindo-monitoring namespace with two workloads: an otel-collector-gateway Deployment (receives OTLP from apps) and an otel-collector-agent Deployment (scrapes infrastructure Prometheus targets and forwards to the gateway). Every app’s OTel SDK flips on too — apps emit OTLP to otel-collector-gateway.kindo-monitoring:4317.
  2. observability.grafana.* (environment bindings) — whether the collector ships out to Grafana. When the credentials are populated, the collector exports to Grafana’s Prometheus remote-write and OTLP endpoints. When empty, the collector still runs but exports only via debug (visible in the pod logs).

You can therefore enable the collector with no backend (useful for verifying app emission before wiring up storage), or skip the collector entirely (peripheries.otel-collector: false). The chart’s default config builder only knows about the Grafana exporters today — pointing the collector at a non-Grafana backend means supplying a hand-written gateway.config via the chart’s escape hatch, not flipping a value.

Wiring up Grafana

Add these under observability.grafana in environment-bindings.yaml:

environment-bindings.yaml
observability:
grafana:
prometheusRemoteWriteEndpoint: "https://prometheus-prod-XX.grafana.net/api/prom/push"
prometheusRemoteWriteUsername: "<grafana cloud datasource id>"
otlpEndpoint: "https://otlp-gateway-prod-XX.grafana.net/otlp"
otlpUsername: "<grafana cloud OTLP datasource id>"
authPassword: "<grafana cloud API token>"
clusterLabel: "<your cluster identifier>"

clusterLabel is shipped as the cluster Prometheus external label on every metric and is the value you’d use to group/filter multi-cluster dashboards. Use a stable per-cluster identifier (e.g. acme-prod, acme-staging).

The CLI projects this block into the cluster-resident secrets file as top-level grafana.* (so the helmfile values overlay can read them directly). If you populate the secrets file by hand via kindo config edit, use the top-level grafana: form to match.

All six fields are validated as an all-or-nothing set — partial population (e.g. four endpoints filled in but authPassword empty) is rejected at config-load time by GrafanaBindings.validate_grafana_completeness. Either fill them all in or leave them all empty.

Chart shape

backend/otel-collector/helm_chart is a single chart that renders both workloads in one release:

  • Gateway (always) — Deployment + Service + ConfigMap (and an HPA when gateway.autoscaling.enabled: true). Receives OTLP from apps and exports to a backend. Built from grafana.*, cluster.*, grafanaCloudAPM.enabled values.
  • Agent (gated on agent.enabled, default true) — Deployment + ConfigMap (no Service; agent is outbound-only). Scrapes Prometheus targets in the cluster and forwards via OTLP to the gateway. Built from agent.scrape.* toggles; the gateway endpoint is auto-derived from the chart’s own fullname helpers, so no manual wiring.

SMK deploys via peripheries.otel-collector: true, which produces one otel-collector release with both workloads. SaaS migrates to a similar one-release pattern with the same chart — set gateway.config if SaaS needs to keep its hand-written gateway config.

The SMK values overlay lives at tools/kindo-cli/deploy/values/otel-collector.yaml.gotmpl. It reads Grafana credentials from the cluster-resident secrets file. Only the Hatchet scrape is on by default in the agent (always in-cluster on SMK); RabbitMQ, Redis, ingress-nginx, and the K8s infra scrapes (cadvisor, kubelet, kube-state-metrics, node-exporter) are off — customers opt in by overriding the relevant agent.scrape.*.enabled flag. Static-target scrapes with enabled: true but an empty target are skipped silently at render time.

gateway.config and agent.config are per-workload escape hatches: when non-empty, the chart’s structured builder is skipped and that value is used as the collector config verbatim.

Troubleshooting

SymptomCauseSolution
No tracesSDK disabledCheck OTEL_SDK_DISABLED is not true
No tracesWrong endpointVerify endpoint is reachable
Disconnected tracesContext not propagatedEnsure trace headers pass between services
No metrics in GrafanaExporter disabledSet OTEL_METRICS_EXPORTER=otlp
OOM errorsLarge span queueReduce OTEL_BSP_MAX_QUEUE_SIZE
No trace IDs in logsWrong loggerUse @kindo/observability logger

Quick Reference

In an SMK install where peripheries.otel-collector: true, the gateway runs in the kindo-monitoring namespace and kindo-cli injects the OTel SDK env vars on every Kindo app pod automatically — you don’t need to set these by hand. The exact set injected (see _otel_env in tools/kindo-cli/src/kindo_cli/config/secret_data.py):

OTEL_SDK_DISABLED=false
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector-gateway.kindo-monitoring:4317
OTEL_EXPORTER_OTLP_PROTOCOL=grpc
OTEL_METRICS_EXPORTER=otlp
OTEL_TRACES_EXPORTER=otlp
OTEL_LOGS_EXPORTER=otlp
OTEL_METRIC_EXPORT_INTERVAL=60000
OTEL_BSP_MAX_QUEUE_SIZE=4096
OTEL_BSP_SCHEDULE_DELAY=1000
OTEL_SERVICE_NAME=<service-name>
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=<env>,service.namespace=kindo

When peripheries.otel-collector: false, the kill-switch + exporter-disable set ships instead:

OTEL_SDK_DISABLED=true
OTEL_EXPORTER_OTLP_ENDPOINT=
OTEL_METRICS_EXPORTER=none
OTEL_TRACES_EXPORTER=none
OTEL_LOGS_EXPORTER=none

If you have custom apps that need to ship to the same gateway, set OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector-gateway.kindo-monitoring:4317 and OTEL_EXPORTER_OTLP_PROTOCOL=grpc on them too.