Observability Guide
Kindo uses OpenTelemetry (OTEL) for unified observability across all services, covering traces, metrics, and logs.
Architecture
+-----------------------------------------------------------+| Kindo Services |+--------+--------+--------+--------+--------+--------------+| API | Task | Ext. |Credits | LiteLLM| Next.js || | Worker | Sync | |(Python)| |+---+----+---+----+---+----+---+----+---+----+---+----------+ | | | | | | +--------+--------+--------+--------+--------+ | +------v------+ | OTEL | | Collector | +------+------+ | +------------+------------+ | | | +----v----+ +----v-----+ +---v------+ | Grafana | | Jaeger/ | | Other | |(Metrics)| | Tempo | | Backends | +---------+ | (Traces) | +----------+ +----------+Key Principles
- Auto-Instrumentation — Node.js services use
@opentelemetry/auto-instrumentations-nodefor automatic HTTP, database, and framework instrumentation. - Custom Instrumentation — Business-critical operations are manually instrumented with custom spans and metrics.
- Unified Collection — All telemetry flows through the OTEL Collector for processing and export.
Environment Variables
Core OTEL Configuration
| Variable | Description | Required |
|---|---|---|
OTEL_SERVICE_NAME | Unique identifier for the service | Yes |
OTEL_EXPORTER_OTLP_ENDPOINT | OTLP collector endpoint | Yes |
OTEL_EXPORTER_OTLP_PROTOCOL | Protocol (http/protobuf or grpc) | No |
OTEL_RESOURCE_ATTRIBUTES | Additional resource attributes | No |
OTEL_METRICS_EXPORTER | Metrics exporter (otlp, prometheus, none) | No |
OTEL_METRIC_EXPORT_INTERVAL | Metrics export interval (ms) | No |
OTEL_BSP_MAX_QUEUE_SIZE | Max spans queued before dropping | No |
OTEL_BSP_SCHEDULE_DELAY | Batch export delay (ms) | No |
OTEL_SDK_DISABLED | Disable OTEL SDK entirely | No |
Trace Sampling
| Variable | Description | Example |
|---|---|---|
OTEL_TRACES_SAMPLER | Sampler type | always_on, traceidratio |
OTEL_TRACES_SAMPLER_ARG | Sampler argument | 0.1 (10% sampling) |
Service-Specific Configuration
API Service
OTEL_SERVICE_NAME=apiOTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318OTEL_RESOURCE_ATTRIBUTES=deployment.environment=productionAuto-instruments HTTP, Express, Prisma, Redis, and RabbitMQ. Includes custom Hatchet instrumentation and Winston log correlation.
Task Worker
OTEL_SERVICE_NAME=task-worker-tsOTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318Custom Hatchet SDK instrumentation for distributed workflow tracing. Propagates W3C traceparent through workflow metadata.
LiteLLM
OTEL_SERVICE_NAME=litellmOTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318Python-based service with OpenTelemetry Python instrumentation. Includes custom metrics guardrail and dynamic metrics via Unleash feature flags.
Other Services
All other services (external-sync, credits, audit-log-exporter, external-poller, Next.js) follow the same pattern with their respective OTEL_SERVICE_NAME.
Custom Metrics
Application Metrics
| Metric | Type | Description | Labels |
|---|---|---|---|
kindo_merge_download_total | Counter | Merge download requests | — |
kindo_workflow_duplicate_transaction_latency_milliseconds | Histogram | Duplicate workflow transaction latency | — |
kindo_external_sync_message_processed_total | Counter | External sync messages processed | topic, event, result |
kindo_api_credits_service_called_total | Counter | API calls to credits service | — |
kindo_chat_tool_state_conversion_total | Counter | Tool parts state conversions | — |
Grafana Dashboard Metrics
| Metric | Description |
|---|---|
kindo_chat_message_total | Total chat messages processed |
kindo_token_usage_total | Token consumption |
kindo_ingestion_bytes_total | Bytes ingested |
kindo_ingestion_duration_seconds_bucket | Ingestion timing (histogram) |
kindo_unprocessed_file_count_null_plaintext_key | Files awaiting processing |
Auto-Instrumentation Coverage
| Library/Framework | What’s Traced |
|---|---|
| HTTP/HTTPS | Incoming and outgoing requests |
| Express | Route handlers and middleware |
| Prisma | Database queries |
| Redis | Cache operations |
| RabbitMQ | Message queue operations |
| gRPC | Service-to-service calls |
| Winston | Log correlation with trace IDs |
Hatchet Workflow Spans
The @kindo/instrumentation-hatchet package traces workflow creation, execution, task completion, event publishing, cron scheduling, and admin operations.
| Span Attribute | Description |
|---|---|
hatchet.workflow_name | Name of the workflow |
hatchet.workflow_run_id | Unique run identifier |
hatchet.step_name | Current step name |
hatchet.task_type | durable or non_durable |
hatchet.task_duration | Execution time (ms) |
Logging
Log Format
All backend services use structured JSON logging with trace correlation:
{ "level": "info", "message": "Request processed", "trace_id": "abc123def456...", "span_id": "789xyz...", "userId": "user-1", "timestamp": "2024-01-15T10:30:00.000Z"}Sensitive Headers
These headers are automatically redacted from logs: authorization, cookie, x-api-key, x-connection-config, x-kindo-metadata.
Best Practices
Service Naming
Use consistent, lowercase, hyphenated names: api, task-worker-ts, external-sync, credits, litellm, audit-log-exporter, external-poller.
Resource Attributes
Include deployment context:
OTEL_RESOURCE_ATTRIBUTES="deployment.environment=production,service.namespace=kindo,service.version=1.2.3"Sampling
For high-traffic services:
OTEL_TRACES_SAMPLER=traceidratioOTEL_TRACES_SAMPLER_ARG=0.1Metric Label Cardinality
Avoid user IDs, request IDs, or session IDs as metric labels. Use span attributes for high-cardinality data. Stick to fixed, enumerable values.
Troubleshooting
| Symptom | Cause | Solution |
|---|---|---|
| No traces | SDK disabled | Check OTEL_SDK_DISABLED is not true |
| No traces | Wrong endpoint | Verify endpoint is reachable |
| Disconnected traces | Context not propagated | Ensure trace headers pass between services |
| No metrics in Grafana | Exporter disabled | Set OTEL_METRICS_EXPORTER=otlp |
| OOM errors | Large span queue | Reduce OTEL_BSP_MAX_QUEUE_SIZE |
| No trace IDs in logs | Wrong logger | Use @kindo/observability logger |
Quick Reference
Minimum required:
OTEL_SERVICE_NAME=your-service-nameOTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318Recommended production:
OTEL_SERVICE_NAME=your-service-nameOTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production,service.version=1.0.0OTEL_TRACES_SAMPLER=traceidratioOTEL_TRACES_SAMPLER_ARG=0.5OTEL_BSP_MAX_QUEUE_SIZE=2048OTEL_BSP_SCHEDULE_DELAY=5000