Skip to content

Observability Guide

Kindo uses OpenTelemetry (OTEL) for unified observability across all services, covering traces, metrics, and logs.

Architecture

+-----------------------------------------------------------+
| Kindo Services |
+--------+--------+--------+--------+--------+--------------+
| API | Task | Ext. |Credits | LiteLLM| Next.js |
| | Worker | Sync | |(Python)| |
+---+----+---+----+---+----+---+----+---+----+---+----------+
| | | | | |
+--------+--------+--------+--------+--------+
|
+------v------+
| OTEL |
| Collector |
+------+------+
|
+------------+------------+
| | |
+----v----+ +----v-----+ +---v------+
| Grafana | | Jaeger/ | | Other |
|(Metrics)| | Tempo | | Backends |
+---------+ | (Traces) | +----------+
+----------+

Key Principles

  1. Auto-Instrumentation — Node.js services use @opentelemetry/auto-instrumentations-node for automatic HTTP, database, and framework instrumentation.
  2. Custom Instrumentation — Business-critical operations are manually instrumented with custom spans and metrics.
  3. Unified Collection — All telemetry flows through the OTEL Collector for processing and export.

Environment Variables

Core OTEL Configuration

VariableDescriptionRequired
OTEL_SERVICE_NAMEUnique identifier for the serviceYes
OTEL_EXPORTER_OTLP_ENDPOINTOTLP collector endpointYes
OTEL_EXPORTER_OTLP_PROTOCOLProtocol (http/protobuf or grpc)No
OTEL_RESOURCE_ATTRIBUTESAdditional resource attributesNo
OTEL_METRICS_EXPORTERMetrics exporter (otlp, prometheus, none)No
OTEL_METRIC_EXPORT_INTERVALMetrics export interval (ms)No
OTEL_BSP_MAX_QUEUE_SIZEMax spans queued before droppingNo
OTEL_BSP_SCHEDULE_DELAYBatch export delay (ms)No
OTEL_SDK_DISABLEDDisable OTEL SDK entirelyNo

Trace Sampling

VariableDescriptionExample
OTEL_TRACES_SAMPLERSampler typealways_on, traceidratio
OTEL_TRACES_SAMPLER_ARGSampler argument0.1 (10% sampling)

Service-Specific Configuration

API Service

OTEL_SERVICE_NAME=api
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production

Auto-instruments HTTP, Express, Prisma, Redis, and RabbitMQ. Includes custom Hatchet instrumentation and Winston log correlation.

Task Worker

OTEL_SERVICE_NAME=task-worker-ts
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318

Custom Hatchet SDK instrumentation for distributed workflow tracing. Propagates W3C traceparent through workflow metadata.

LiteLLM

OTEL_SERVICE_NAME=litellm
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318

Python-based service with OpenTelemetry Python instrumentation. Includes custom metrics guardrail and dynamic metrics via Unleash feature flags.

Other Services

All other services (external-sync, credits, audit-log-exporter, external-poller, Next.js) follow the same pattern with their respective OTEL_SERVICE_NAME.

Custom Metrics

Application Metrics

MetricTypeDescriptionLabels
kindo_merge_download_totalCounterMerge download requests
kindo_workflow_duplicate_transaction_latency_millisecondsHistogramDuplicate workflow transaction latency
kindo_external_sync_message_processed_totalCounterExternal sync messages processedtopic, event, result
kindo_api_credits_service_called_totalCounterAPI calls to credits service
kindo_chat_tool_state_conversion_totalCounterTool parts state conversions

Grafana Dashboard Metrics

MetricDescription
kindo_chat_message_totalTotal chat messages processed
kindo_token_usage_totalToken consumption
kindo_ingestion_bytes_totalBytes ingested
kindo_ingestion_duration_seconds_bucketIngestion timing (histogram)
kindo_unprocessed_file_count_null_plaintext_keyFiles awaiting processing

Auto-Instrumentation Coverage

Library/FrameworkWhat’s Traced
HTTP/HTTPSIncoming and outgoing requests
ExpressRoute handlers and middleware
PrismaDatabase queries
RedisCache operations
RabbitMQMessage queue operations
gRPCService-to-service calls
WinstonLog correlation with trace IDs

Hatchet Workflow Spans

The @kindo/instrumentation-hatchet package traces workflow creation, execution, task completion, event publishing, cron scheduling, and admin operations.

Span AttributeDescription
hatchet.workflow_nameName of the workflow
hatchet.workflow_run_idUnique run identifier
hatchet.step_nameCurrent step name
hatchet.task_typedurable or non_durable
hatchet.task_durationExecution time (ms)

Logging

Log Format

All backend services use structured JSON logging with trace correlation:

{
"level": "info",
"message": "Request processed",
"trace_id": "abc123def456...",
"span_id": "789xyz...",
"userId": "user-1",
"timestamp": "2024-01-15T10:30:00.000Z"
}

Sensitive Headers

These headers are automatically redacted from logs: authorization, cookie, x-api-key, x-connection-config, x-kindo-metadata.

Best Practices

Service Naming

Use consistent, lowercase, hyphenated names: api, task-worker-ts, external-sync, credits, litellm, audit-log-exporter, external-poller.

Resource Attributes

Include deployment context:

OTEL_RESOURCE_ATTRIBUTES="deployment.environment=production,service.namespace=kindo,service.version=1.2.3"

Sampling

For high-traffic services:

OTEL_TRACES_SAMPLER=traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1

Metric Label Cardinality

Avoid user IDs, request IDs, or session IDs as metric labels. Use span attributes for high-cardinality data. Stick to fixed, enumerable values.

Troubleshooting

SymptomCauseSolution
No tracesSDK disabledCheck OTEL_SDK_DISABLED is not true
No tracesWrong endpointVerify endpoint is reachable
Disconnected tracesContext not propagatedEnsure trace headers pass between services
No metrics in GrafanaExporter disabledSet OTEL_METRICS_EXPORTER=otlp
OOM errorsLarge span queueReduce OTEL_BSP_MAX_QUEUE_SIZE
No trace IDs in logsWrong loggerUse @kindo/observability logger

Quick Reference

Minimum required:

OTEL_SERVICE_NAME=your-service-name
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318

Recommended production:

OTEL_SERVICE_NAME=your-service-name
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production,service.version=1.0.0
OTEL_TRACES_SAMPLER=traceidratio
OTEL_TRACES_SAMPLER_ARG=0.5
OTEL_BSP_MAX_QUEUE_SIZE=2048
OTEL_BSP_SCHEDULE_DELAY=5000