Observability Guide
Kindo uses OpenTelemetry (OTEL) for unified observability across all services, covering traces, metrics, and logs.
Architecture
+-----------------------------------------------------------+| Kindo Services |+--------+--------+--------+--------+--------+--------------+| API | Task | Ext. |Credits | LiteLLM| Next.js || | Worker | Sync | |(Python)| |+---+----+---+----+---+----+---+----+---+----+---+----------+ | | | | | | +--------+--------+--------+--------+--------+ | OTLP +------v------+ +------------+ | Gateway |<-----+ Agent | | (OTel coll) | OTLP|(Prom scrape| +------+------+ | → gateway)| | +------------+ | v +--------+--------+ | Grafana / Tempo | | backend | +-----------------+Gateway and Agent are two workloads of the same backend/otel-collector chart, deployed as a single helm release in the kindo-monitoring namespace. Apps emit OTLP to the gateway; the agent scrapes Prometheus targets across the cluster and forwards via OTLP to the same gateway.
Key Principles
- Auto-Instrumentation — Node.js services use
@opentelemetry/auto-instrumentations-nodefor automatic HTTP, database, and framework instrumentation. - Custom Instrumentation — Business-critical operations are manually instrumented with custom spans and metrics.
- Unified Collection — All telemetry flows through the OTEL Collector for processing and export.
Environment Variables
Core OTEL Configuration
| Variable | Description | Required |
|---|---|---|
OTEL_SERVICE_NAME | Unique identifier for the service | Yes |
OTEL_EXPORTER_OTLP_ENDPOINT | OTLP collector endpoint | Yes |
OTEL_EXPORTER_OTLP_PROTOCOL | Protocol (http/protobuf or grpc) | No |
OTEL_RESOURCE_ATTRIBUTES | Additional resource attributes | No |
OTEL_METRICS_EXPORTER | Metrics exporter (otlp, prometheus, none) | No |
OTEL_TRACES_EXPORTER | Traces exporter (otlp, none) | No |
OTEL_LOGS_EXPORTER | Logs exporter (otlp, none) | No |
OTEL_METRIC_EXPORT_INTERVAL | Metrics export interval (ms) | No |
OTEL_BSP_MAX_QUEUE_SIZE | Max spans queued before dropping | No |
OTEL_BSP_SCHEDULE_DELAY | Batch export delay (ms) | No |
OTEL_SDK_DISABLED | Disable OTEL SDK entirely | No |
Trace Sampling
| Variable | Description | Example |
|---|---|---|
OTEL_TRACES_SAMPLER | Sampler type | always_on, traceidratio |
OTEL_TRACES_SAMPLER_ARG | Sampler argument | 0.1 (10% sampling) |
Service-Specific Configuration
API Service
OTEL_SERVICE_NAME=apiOTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector-gateway.kindo-monitoring:4317OTEL_EXPORTER_OTLP_PROTOCOL=grpcOTEL_RESOURCE_ATTRIBUTES=deployment.environment=productionAuto-instruments HTTP, Express, Prisma, Redis, and RabbitMQ. Includes custom Hatchet instrumentation and Winston log correlation.
Task Worker
OTEL_SERVICE_NAME=task-worker-tsOTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector-gateway.kindo-monitoring:4317OTEL_EXPORTER_OTLP_PROTOCOL=grpcCustom Hatchet SDK instrumentation for distributed workflow tracing. Propagates W3C traceparent through workflow metadata.
LiteLLM
OTEL_SERVICE_NAME=litellmOTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector-gateway.kindo-monitoring:4317OTEL_EXPORTER_OTLP_PROTOCOL=grpcPython-based service with OpenTelemetry Python instrumentation. Includes custom metrics guardrail and dynamic metrics via Unleash feature flags.
Other Services
All other services (external-sync, credits, audit-log-exporter, external-poller, Next.js) follow the same pattern with their respective OTEL_SERVICE_NAME.
Custom Metrics
Application Metrics
| Metric | Type | Description | Labels |
|---|---|---|---|
kindo_merge_download_total | Counter | Merge download requests | — |
kindo_workflow_duplicate_transaction_latency_milliseconds | Histogram | Duplicate workflow transaction latency | — |
kindo_external_sync_message_processed_total | Counter | External sync messages processed | topic, event, result |
kindo_api_credits_service_called_total | Counter | API calls to credits service | — |
kindo_chat_tool_state_conversion_total | Counter | Tool parts state conversions | — |
Grafana Dashboard Metrics
| Metric | Description |
|---|---|
kindo_chat_message_total | Total chat messages processed |
kindo_token_usage_total | Token consumption |
kindo_ingestion_bytes_total | Bytes ingested |
kindo_ingestion_duration_seconds_bucket | Ingestion timing (histogram) |
kindo_unprocessed_file_count_null_plaintext_key | Files awaiting processing |
Auto-Instrumentation Coverage
| Library/Framework | What’s Traced |
|---|---|
| HTTP/HTTPS | Incoming and outgoing requests |
| Express | Route handlers and middleware |
| Prisma | Database queries |
| Redis | Cache operations |
| RabbitMQ | Message queue operations |
| gRPC | Service-to-service calls |
| Winston | Log correlation with trace IDs |
Hatchet Workflow Spans
Hatchet workflows are instrumented via HatchetInstrumentor from @hatchet-dev/typescript-sdk/opentelemetry, which ships with the SDK (v1.18.0+). It traces workflow creation, execution, task completion, event publishing, cron scheduling, and admin operations. The previous homegrown @kindo/instrumentation-hatchet package was removed once the upstream instrumentation shipped feature parity.
| Span Attribute | Description |
|---|---|
hatchet.workflow_name | Name of the workflow |
hatchet.workflow_run_id | Unique run identifier |
hatchet.step_name | Current step name |
hatchet.task_type | durable or non_durable |
hatchet.task_duration | Execution time (ms) |
Durable Chat Spans
The task-worker-ts service adds durable chat domain spans inside the Hatchet
workflow shell. These spans use durable_chat.* names for setup, LLM calls,
tool execution, step commits, success finalization, and failure finalization.
| Span Name | Description |
|---|---|
durable_chat.setup | Durable chat setup and persisted conversation boundary loading |
durable_chat.llm_call | Model request and response streaming work |
durable_chat.tool_execution | Tool execution child work |
durable_chat.commit | Step commit publication after a successful child result |
durable_chat.finalize_success | Successful assistant message finalization |
durable_chat.finalize_failure | Terminal failure persistence |
Snake_case attributes are canonical for durable chat trace queries. Important
attributes include conversation_id, message_id, task_name,
durable_chat_step_kind, logical_step_id, workflow_run_id,
current_workflow_run_id, parent_workflow_run_id, attempt_id, and
agent_run_id. The failure-finalization span sets
durable_chat_step_kind=finalize_failure and
durable_chat_failure=true.
Some traces also include camelCase aliases such as conversationId,
messageId, and workflowRunId for compatibility with legacy Tempo queries.
Prefer the snake_case attributes in new dashboards and runbooks.
Logging
Log Format
All backend services use structured JSON logging with trace correlation:
{ "level": "info", "message": "Request processed", "trace_id": "abc123def456...", "span_id": "789xyz...", "user_id": "user-1", "timestamp": "2024-01-15T10:30:00.000Z"}Use snake_case for custom structured log fields and custom telemetry label
keys added by application code. For example, prefer user_id,
workflow_id, and operation_name over userId, workflowId, and
operationName.
Sensitive Headers
These headers are automatically redacted from logs: authorization, cookie, x-api-key, x-connection-config, x-kindo-metadata.
Best Practices
Service Naming
Use consistent, lowercase, hyphenated names: api, task-worker-ts, external-sync, credits, litellm, audit-log-exporter, external-poller.
Resource Attributes
Include deployment context:
OTEL_RESOURCE_ATTRIBUTES="deployment.environment=production,service.namespace=kindo,service.version=1.2.3"Sampling
For high-traffic services:
OTEL_TRACES_SAMPLER=traceidratioOTEL_TRACES_SAMPLER_ARG=0.1Metric Label Cardinality
Avoid user IDs, request IDs, or session IDs as metric labels. Use span attributes for high-cardinality data. Stick to fixed, enumerable values.
Use snake_case for custom metric label keys.
Self-managed OTel Collector (SMK)
Two orthogonal gates control observability:
peripheries.otel-collector(install contract) — whether the collector exists. When on, a singleotel-collectorrelease deploys to thekindo-monitoringnamespace with two workloads: anotel-collector-gatewayDeployment (receives OTLP from apps) and anotel-collector-agentDeployment (scrapes infrastructure Prometheus targets and forwards to the gateway). Every app’s OTel SDK flips on too — apps emit OTLP tootel-collector-gateway.kindo-monitoring:4317.observability.grafana.*(environment bindings) — whether the collector ships out to Grafana. When the credentials are populated, the collector exports to Grafana’s Prometheus remote-write and OTLP endpoints. When empty, the collector still runs but exports only viadebug(visible in the pod logs).
You can therefore enable the collector with no backend (useful for verifying app emission before wiring up storage), or skip the collector entirely (peripheries.otel-collector: false). The chart’s default config builder only knows about the Grafana exporters today — pointing the collector at a non-Grafana backend means supplying a hand-written gateway.config via the chart’s escape hatch, not flipping a value.
Wiring up Grafana
Add these under observability.grafana in environment-bindings.yaml:
observability: grafana: prometheusRemoteWriteEndpoint: "https://prometheus-prod-XX.grafana.net/api/prom/push" prometheusRemoteWriteUsername: "<grafana cloud datasource id>" otlpEndpoint: "https://otlp-gateway-prod-XX.grafana.net/otlp" otlpUsername: "<grafana cloud OTLP datasource id>" authPassword: "<grafana cloud API token>" clusterLabel: "<your cluster identifier>"clusterLabel is shipped as the cluster Prometheus external label on every metric and is the value you’d use to group/filter multi-cluster dashboards. Use a stable per-cluster identifier (e.g. acme-prod, acme-staging).
The CLI projects this block into the cluster-resident secrets file as top-level grafana.* (so the helmfile values overlay can read them directly). If you populate the secrets file by hand via kindo config edit, use the top-level grafana: form to match.
All six fields are validated as an all-or-nothing set — partial population (e.g. four endpoints filled in but authPassword empty) is rejected at config-load time by GrafanaBindings.validate_grafana_completeness. Either fill them all in or leave them all empty.
Chart shape
backend/otel-collector/helm_chart is a single chart that renders both workloads in one release:
- Gateway (always) — Deployment + Service + ConfigMap (and an HPA when
gateway.autoscaling.enabled: true). Receives OTLP from apps and exports to a backend. Built fromgrafana.*,cluster.*,grafanaCloudAPM.enabledvalues. - Agent (gated on
agent.enabled, default true) — Deployment + ConfigMap (no Service; agent is outbound-only). Scrapes Prometheus targets in the cluster and forwards via OTLP to the gateway. Built fromagent.scrape.*toggles; the gateway endpoint is auto-derived from the chart’s own fullname helpers, so no manual wiring.
SMK deploys via peripheries.otel-collector: true, which produces one otel-collector release with both workloads. SaaS migrates to a similar one-release pattern with the same chart — set gateway.config if SaaS needs to keep its hand-written gateway config.
The SMK values overlay lives at tools/kindo-cli/deploy/values/otel-collector.yaml.gotmpl. It reads Grafana credentials from the cluster-resident secrets file. Only the Hatchet scrape is on by default in the agent (always in-cluster on SMK); RabbitMQ, Redis, ingress-nginx, and the K8s infra scrapes (cadvisor, kubelet, kube-state-metrics, node-exporter) are off — customers opt in by overriding the relevant agent.scrape.*.enabled flag. Static-target scrapes with enabled: true but an empty target are skipped silently at render time.
gateway.config and agent.config are per-workload escape hatches: when non-empty, the chart’s structured builder is skipped and that value is used as the collector config verbatim.
Troubleshooting
| Symptom | Cause | Solution |
|---|---|---|
| No traces | SDK disabled | Check OTEL_SDK_DISABLED is not true |
| No traces | Wrong endpoint | Verify endpoint is reachable |
| Disconnected traces | Context not propagated | Ensure trace headers pass between services |
| No metrics in Grafana | Exporter disabled | Set OTEL_METRICS_EXPORTER=otlp |
| OOM errors | Large span queue | Reduce OTEL_BSP_MAX_QUEUE_SIZE |
| No trace IDs in logs | Wrong logger | Use @kindo/observability logger |
Quick Reference
In an SMK install where peripheries.otel-collector: true, the gateway runs in the kindo-monitoring namespace and kindo-cli injects the OTel SDK env vars on every Kindo app pod automatically — you don’t need to set these by hand. The exact set injected (see _otel_env in tools/kindo-cli/src/kindo_cli/config/secret_data.py):
OTEL_SDK_DISABLED=falseOTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector-gateway.kindo-monitoring:4317OTEL_EXPORTER_OTLP_PROTOCOL=grpcOTEL_METRICS_EXPORTER=otlpOTEL_TRACES_EXPORTER=otlpOTEL_LOGS_EXPORTER=otlpOTEL_METRIC_EXPORT_INTERVAL=60000OTEL_BSP_MAX_QUEUE_SIZE=4096OTEL_BSP_SCHEDULE_DELAY=1000OTEL_SERVICE_NAME=<service-name>OTEL_RESOURCE_ATTRIBUTES=deployment.environment=<env>,service.namespace=kindoWhen peripheries.otel-collector: false, the kill-switch + exporter-disable set ships instead:
OTEL_SDK_DISABLED=trueOTEL_EXPORTER_OTLP_ENDPOINT=OTEL_METRICS_EXPORTER=noneOTEL_TRACES_EXPORTER=noneOTEL_LOGS_EXPORTER=noneIf you have custom apps that need to ship to the same gateway, set OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector-gateway.kindo-monitoring:4317 and OTEL_EXPORTER_OTLP_PROTOCOL=grpc on them too.