# Observability Guide
Comprehensive reference for OpenTelemetry configuration and custom metrics in Kindo services
Table of Contents
- Overview
- Architecture
- Environment Variables Reference
- Service-Specific Configuration
- Custom Metrics Reference
- Instrumentation Details
- Logging Configuration
- Best Practices
- Troubleshooting
Overview
Kindo uses OpenTelemetry (OTEL) for unified observability across all services. The telemetry stack consists of:
- Traces: Distributed tracing for request flow visualization
- Metrics: Custom application metrics and standard runtime metrics
- Logs: Structured JSON logging with trace correlation
Key Principles
- Auto-Instrumentation: Node.js services use
@opentelemetry/auto-instrumentations-nodefor automatic HTTP, database, and framework instrumentation - Custom Instrumentation: Business-critical operations are manually instrumented with custom spans and metrics
- Unified Collection: All telemetry flows through the OTEL Collector for processing and export
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Kindo Services │
├──────────┬──────────┬──────────┬──────────┬──────────┬──────────┤
│ API │ Task │ External │ Credits │ LiteLLM │ Next.js │
│ │ Worker │ Sync │ │ (Python) │ │
└────┬─────┴────┬─────┴────┬─────┴────┬─────┴────┬─────┴────┬─────┘
│ │ │ │ │ │
└──────────┴──────────┴──────────┴──────────┴──────────┘
│
┌─────────▼─────────┐
│ OTEL Collector │
│ (Gateway) │
└─────────┬─────────┘
│
┌───────────────┼───────────────┐
│ │ │
┌────▼────┐ ┌─────▼─────┐ ┌─────▼─────┐
│ Grafana │ │ Jaeger/ │ │ Other │
│(Metrics)│ │ Tempo │ │ Backends │
└─────────┘ │ (Traces) │ └───────────┘
└───────────┘
Environment Variables Reference
Core OTEL Configuration
These environment variables are read directly by the OpenTelemetry SDK and apply to all Node.js services.
| Variable | Description | Example | Required |
|----------|-------------|---------|----------|
| OTEL_SERVICE_NAME | Unique identifier for the service in traces | api, task-worker-ts | Yes |
| OTEL_EXPORTER_OTLP_ENDPOINT | OTLP collector endpoint URL | http://otel-collector:4318 | Yes |
| OTEL_EXPORTER_OTLP_PROTOCOL | Protocol for OTLP export | http/protobuf, grpc | No (default: http/protobuf) |
| OTEL_RESOURCE_ATTRIBUTES | Additional resource attributes | deployment.environment=production,service.namespace=kindo | No |
| OTEL_METRICS_EXPORTER | Metrics exporter type | otlp, prometheus, none | No (default: otlp) |
| OTEL_METRIC_EXPORT_INTERVAL | Metrics export interval in ms | 60000 | No (default: 60000) |
| OTEL_BSP_MAX_QUEUE_SIZE | Max spans queued before dropping | 2048 | No |
| OTEL_BSP_SCHEDULE_DELAY | Batch export delay in ms | 5000 | No |
| OTEL_SDK_DISABLED | Completely disable OTEL SDK | true, false | No |
Trace Sampling Configuration
| Variable | Description | Example |
|----------|-------------|---------|
| OTEL_TRACES_SAMPLER | Sampler type | always_on, always_off, traceidratio |
| OTEL_TRACES_SAMPLER_ARG | Sampler argument | 0.1 (for 10% sampling) |
Log Configuration
| Variable | Description | Example |
|----------|-------------|---------|
| OTEL_LOGS_EXPORTER | Logs exporter type | otlp, console, none |
Service-Specific Configuration
API Service (backend/api)
The API service is the main backend REST/tRPC API handling user requests.
Environment Variables
# Core service config
PORT=3000
NODE_ENV=production
# OTEL Configuration
OTEL_SERVICE_NAME=api
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production
Instrumentation Features
- Auto-instrumentation for HTTP, Express, Prisma, Redis, RabbitMQ
- Custom Hatchet instrumentation for workflow triggering
- Winston logging with trace ID correlation
Task Worker (backend/task-worker-ts)
The Task Worker handles background jobs via Hatchet including AI chat operations.
Environment Variables
# Core service config
NODE_ENV=production
WORKER_TYPE=standard # or 'large', 'all'
# OTEL Configuration
OTEL_SERVICE_NAME=task-worker-ts
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production
Instrumentation Features
- Auto-instrumentation for HTTP, Prisma, Redis
- Custom Hatchet SDK instrumentation for distributed workflow tracing
- Traces workflow creation, execution, and task completion
- Propagates W3C traceparent through workflow metadata
External Sync (backend/external-sync)
Handles integration synchronization and content processing.
Environment Variables
NODE_ENV=production
PORT=3000
# OTEL Configuration
OTEL_SERVICE_NAME=external-sync
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production
Custom Metrics Emitted
kindo.external_sync.message.processed- Counter with labels:topic,event,result
Credits Service (backend/credits)
Handles credit balance and token usage tracking.
Environment Variables
NODE_ENV=production
PORT=3000
# OTEL Configuration
OTEL_SERVICE_NAME=credits
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production
Custom Metrics Emitted
kindo.api.credits_service_called- Counter for API calls to the credits service
LiteLLM Service (backend/litellm)
Python-based LLM proxy service handling model routing and usage tracking.
Environment Variables
# OTEL Configuration
OTEL_SERVICE_NAME=litellm
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production
# Feature flags for custom metrics
UNLEASH_URL=http://unleash:4242
UNLEASH_API_KEY=your-api-key
Instrumentation Features
- Uses Python OpenTelemetry instrumentation packages
- Custom metrics guardrail for response pattern matching
- Dynamic metrics configuration via Unleash feature flags
Next.js Frontend (apps/next)
Server-side rendered React application.
Environment Variables
# OTEL Configuration (optional)
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
OTEL_SDK_DISABLED=true # Set to disable if not using OTEL collection
Instrumentation Features
- Uses
@vercel/otelfor Next.js-specific instrumentation - Exports logs to OTEL collector when endpoint is configured
Audit Log Exporter (backend/audit-log-exporter)
Exports audit logs for compliance and security monitoring.
Environment Variables
NODE_ENV=production
# OTEL Configuration
OTEL_SERVICE_NAME=audit-log-exporter
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production
External Poller (backend/external-poller)
Polls external services for integration data.
Environment Variables
NODE_ENV=production
# OTEL Configuration
OTEL_SERVICE_NAME=external-poller
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production
Custom Metrics Reference
Kindo Application Metrics
These custom metrics are emitted by Kindo services and can be queried in your metrics backend (Grafana, Prometheus, etc.).
| Metric Name | Type | Description | Labels |
|-------------|------|-------------|--------|
| kindo_merge_download_total | Counter | Count of Merge download requests | - |
| kindo_workflow_duplicate_transaction_latency_milliseconds | Histogram | Latency of duplicate workflow transactions | - |
| kindo_model_delete_transaction_latency_milliseconds | Histogram | Latency of delete model transactions | - |
| kindo_external_sync_message_processed_total | Counter | External sync messages processed | topic, event, result |
| kindo_api_credits_service_called_total | Counter | API calls to credits service | - |
| kindo_chat_tool_state_conversion_total | Counter | Tool parts converted from incomplete to output-available | - |
Additional Grafana Metrics
These metrics are available in production Grafana dashboards:
| Metric | Description |
|--------|-------------|
| kindo_chat_message_total | Total chat messages processed |
| kindo_token_usage_total | Token consumption tracking |
| kindo_ingestion_bytes_total | Total bytes ingested |
| kindo_ingestion_duration_seconds_bucket | Ingestion operation timing (histogram) |
| kindo_unprocessed_file_count_null_plaintext_key | Files awaiting processing (null key) |
| kindo_unprocessed_file_count_outdated_llama_indexer_ingestion_version | Files needing re-indexing |
| kindo_unprocessed_external_cache_count_null_plaintext_key | External cache items awaiting processing |
Instrumentation Details
Auto-Instrumentation
All Node.js services automatically instrument the following via @opentelemetry/auto-instrumentations-node:
| Library/Framework | What's Traced |
|-------------------|---------------|
| HTTP/HTTPS | Incoming and outgoing requests |
| Express | Route handlers and middleware |
| Prisma | Database queries |
| Redis | Cache operations |
| RabbitMQ | Message queue operations |
| gRPC | Service-to-service calls |
| Winston | Log correlation with trace IDs |
Hatchet Workflow Instrumentation
The custom @kindo/instrumentation-hatchet package provides distributed tracing for Hatchet workflows.
Traced Operations
- Workflow creation and execution
- Task creation (durable and non-durable)
- Step run execution
- Event publishing
- Cron job scheduling
- Admin operations
Span Attributes
| Attribute | Description |
|-----------|-------------|
| hatchet.workflow_name | Name of the workflow |
| hatchet.workflow_run_id | Unique workflow run identifier |
| hatchet.step_name | Current step name |
| hatchet.step_run_id | Step run identifier |
| hatchet.task_type | durable or non_durable |
| hatchet.trigger_type | How workflow was triggered (e.g., cron, event) |
| hatchet.cron_schedule | Cron expression if cron-triggered |
| hatchet.task_duration | Task execution time in milliseconds |
| hatchet.retry_count | Number of retry attempts |
Context Propagation
The instrumentation automatically:
- Injects W3C traceparent into workflow/task metadata
- Extracts and links traces from incoming action metadata
- Preserves distributed trace context across service boundaries
Logging Configuration
Log Format
All backend services use structured JSON logging with automatic trace correlation.
Production Format:
{
"level": "info",
"message": "Request processed",
"trace_id": "abc123def456...",
"span_id": "789xyz...",
"userId": "user-1",
"timestamp": "2024-01-15T10:30:00.000Z"
}
Log Levels
| Level | Description | When to Use |
|-------|-------------|-------------|
| debug | Detailed debugging info | Development only (not shown in production) |
| info | Normal operations | Standard operational events |
| warn | Recoverable issues | Non-critical problems |
| error | Failures requiring attention | Errors that need investigation |
Sensitive Headers Filtered
The following headers are automatically redacted from request logs:
authorizationcookiex-api-keyx-connection-configx-kindo-metadata
Best Practices
1. Service Naming
Use consistent, lowercase, hyphenated names:
| Service | OTEL_SERVICE_NAME |
|---------|---------------------|
| API | api |
| Task Worker | task-worker-ts |
| External Sync | external-sync |
| Credits | credits |
| LiteLLM | litellm |
| Audit Log Exporter | audit-log-exporter |
| External Poller | external-poller |
2. Resource Attributes
Include deployment context for better filtering and grouping:
OTEL_RESOURCE_ATTRIBUTES="deployment.environment=production,service.namespace=kindo,service.version=1.2.3"
3. Sampling Configuration
For high-traffic services, configure sampling to manage data volume:
# Sample 10% of traces
OTEL_TRACES_SAMPLER=traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1
4. Batch Processing Tuning
Adjust batch settings based on your traffic patterns:
# Increase queue size for high-volume services
OTEL_BSP_MAX_QUEUE_SIZE=4096
# Reduce export frequency for lower overhead
OTEL_BSP_SCHEDULE_DELAY=10000
5. Metric Label Cardinality
Avoid high-cardinality labels in metrics:
- Do not use user IDs, request IDs, or session IDs as metric labels
- Use span attributes for high-cardinality data instead
- Stick to fixed, enumerable values for labels (e.g.,
status=success|error)
Troubleshooting
Traces Not Appearing
| Symptom | Possible Cause | Solution |
|---------|----------------|----------|
| No traces at all | OTEL disabled | Check OTEL_SDK_DISABLED is not true |
| No traces at all | Wrong endpoint | Verify OTEL_EXPORTER_OTLP_ENDPOINT is reachable |
| No traces at all | Sampling disabled | Check OTEL_TRACES_SAMPLER is not always_off |
| Traces disconnected | Context not propagated | Ensure services pass trace headers in HTTP requests |
Missing Span Context in Hatchet Workflows
| Symptom | Possible Cause | Solution |
|---------|----------------|----------|
| Workflow spans not linked | Instrumentation not loaded | Verify @kindo/instrumentation-hatchet is registered |
| Parent spans missing | Metadata not propagated | Check workflow options include trace context |
Metrics Not Exporting
| Symptom | Possible Cause | Solution |
|---------|----------------|----------|
| No metrics in Grafana | Exporter disabled | Set OTEL_METRICS_EXPORTER=otlp |
| Delayed metrics | Long export interval | Reduce OTEL_METRIC_EXPORT_INTERVAL |
| Partial metrics | Collector not configured | Verify collector accepts metrics on configured endpoint |
High Memory Usage
| Symptom | Possible Cause | Solution |
|---------|----------------|----------|
| OOM errors | Large span queue | Reduce OTEL_BSP_MAX_QUEUE_SIZE |
| Gradual memory growth | Too many unique metrics | Review metric cardinality, reduce unique label values |
Log Correlation Issues
| Symptom | Possible Cause | Solution |
|---------|----------------|----------|
| No trace IDs in logs | Wrong logger | Ensure using @kindo/observability logger |
| Intermittent trace IDs | Context lost | Check async operations maintain OTEL context |
Quick Reference
Minimum Required Configuration
# Required for all Node.js services
OTEL_SERVICE_NAME=your-service-name
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
Recommended Production Configuration
OTEL_SERVICE_NAME=your-service-name
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production,service.version=1.0.0
OTEL_TRACES_SAMPLER=traceidratio
OTEL_TRACES_SAMPLER_ARG=0.5
OTEL_BSP_MAX_QUEUE_SIZE=2048
OTEL_BSP_SCHEDULE_DELAY=5000
Development Configuration (OTEL Disabled)
OTEL_SDK_DISABLED=true
Development Configuration (Console Export for Debugging)
OTEL_SERVICE_NAME=your-service-name
OTEL_TRACES_EXPORTER=console
OTEL_METRICS_EXPORTER=console
OTEL_LOGS_EXPORTER=console
Production Checklist
OTEL_SERVICE_NAMEset uniquely for each serviceOTEL_EXPORTER_OTLP_ENDPOINTpoints to collectorOTEL_RESOURCE_ATTRIBUTESincludes environment and version- Sampling configured appropriately for traffic volume
- OTEL Collector configured to receive and export telemetry
- Grafana/Jaeger dashboards configured for visualization
- Alerts configured for critical metrics
This document is maintained alongside the codebase. Update when adding new services, metrics, or changing observability configuration.