Observability Guide

Prev Next

# Observability Guide

Comprehensive reference for OpenTelemetry configuration and custom metrics in Kindo services

Table of Contents

  1. Overview
  2. Architecture
  3. Environment Variables Reference
  4. Service-Specific Configuration
  5. Custom Metrics Reference
  6. Instrumentation Details
  7. Logging Configuration
  8. Best Practices
  9. Troubleshooting

Overview

Kindo uses OpenTelemetry (OTEL) for unified observability across all services. The telemetry stack consists of:

  • Traces: Distributed tracing for request flow visualization
  • Metrics: Custom application metrics and standard runtime metrics
  • Logs: Structured JSON logging with trace correlation

Key Principles

  1. Auto-Instrumentation: Node.js services use @opentelemetry/auto-instrumentations-node for automatic HTTP, database, and framework instrumentation
  2. Custom Instrumentation: Business-critical operations are manually instrumented with custom spans and metrics
  3. Unified Collection: All telemetry flows through the OTEL Collector for processing and export

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Kindo Services                            │
├──────────┬──────────┬──────────┬──────────┬──────────┬──────────┤
│   API    │  Task    │ External │ Credits  │  LiteLLM │  Next.js │
│          │  Worker  │   Sync   │          │ (Python) │          │
└────┬─────┴────┬─────┴────┬─────┴────┬─────┴────┬─────┴────┬─────┘
     │          │          │          │          │          │
     └──────────┴──────────┴──────────┴──────────┴──────────┘
                              │
                    ┌─────────▼─────────┐
                    │   OTEL Collector  │
                    │    (Gateway)      │
                    └─────────┬─────────┘
                              │
              ┌───────────────┼───────────────┐
              │               │               │
         ┌────▼────┐    ┌─────▼─────┐   ┌─────▼─────┐
         │ Grafana │    │  Jaeger/  │   │   Other   │
         │(Metrics)│    │   Tempo   │   │  Backends │
         └─────────┘    │ (Traces)  │   └───────────┘
                        └───────────┘

Environment Variables Reference

Core OTEL Configuration

These environment variables are read directly by the OpenTelemetry SDK and apply to all Node.js services.

| Variable | Description | Example | Required |

|----------|-------------|---------|----------|

| OTEL_SERVICE_NAME | Unique identifier for the service in traces | api, task-worker-ts | Yes |

| OTEL_EXPORTER_OTLP_ENDPOINT | OTLP collector endpoint URL | http://otel-collector:4318 | Yes |

| OTEL_EXPORTER_OTLP_PROTOCOL | Protocol for OTLP export | http/protobuf, grpc | No (default: http/protobuf) |

| OTEL_RESOURCE_ATTRIBUTES | Additional resource attributes | deployment.environment=production,service.namespace=kindo | No |

| OTEL_METRICS_EXPORTER | Metrics exporter type | otlp, prometheus, none | No (default: otlp) |

| OTEL_METRIC_EXPORT_INTERVAL | Metrics export interval in ms | 60000 | No (default: 60000) |

| OTEL_BSP_MAX_QUEUE_SIZE | Max spans queued before dropping | 2048 | No |

| OTEL_BSP_SCHEDULE_DELAY | Batch export delay in ms | 5000 | No |

| OTEL_SDK_DISABLED | Completely disable OTEL SDK | true, false | No |

Trace Sampling Configuration

| Variable | Description | Example |

|----------|-------------|---------|

| OTEL_TRACES_SAMPLER | Sampler type | always_on, always_off, traceidratio |

| OTEL_TRACES_SAMPLER_ARG | Sampler argument | 0.1 (for 10% sampling) |

Log Configuration

| Variable | Description | Example |

|----------|-------------|---------|

| OTEL_LOGS_EXPORTER | Logs exporter type | otlp, console, none |


Service-Specific Configuration

API Service (backend/api)

The API service is the main backend REST/tRPC API handling user requests.

Environment Variables

# Core service config
PORT=3000
NODE_ENV=production
# OTEL Configuration
OTEL_SERVICE_NAME=api
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production

Instrumentation Features

  • Auto-instrumentation for HTTP, Express, Prisma, Redis, RabbitMQ
  • Custom Hatchet instrumentation for workflow triggering
  • Winston logging with trace ID correlation

Task Worker (backend/task-worker-ts)

The Task Worker handles background jobs via Hatchet including AI chat operations.

Environment Variables

# Core service config
NODE_ENV=production
WORKER_TYPE=standard  # or 'large', 'all'
# OTEL Configuration
OTEL_SERVICE_NAME=task-worker-ts
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production

Instrumentation Features

  • Auto-instrumentation for HTTP, Prisma, Redis
  • Custom Hatchet SDK instrumentation for distributed workflow tracing
  • Traces workflow creation, execution, and task completion
  • Propagates W3C traceparent through workflow metadata

External Sync (backend/external-sync)

Handles integration synchronization and content processing.

Environment Variables

NODE_ENV=production
PORT=3000
# OTEL Configuration
OTEL_SERVICE_NAME=external-sync
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production

Custom Metrics Emitted

  • kindo.external_sync.message.processed - Counter with labels: topic, event, result

Credits Service (backend/credits)

Handles credit balance and token usage tracking.

Environment Variables

NODE_ENV=production
PORT=3000
# OTEL Configuration
OTEL_SERVICE_NAME=credits
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production

Custom Metrics Emitted

  • kindo.api.credits_service_called - Counter for API calls to the credits service

LiteLLM Service (backend/litellm)

Python-based LLM proxy service handling model routing and usage tracking.

Environment Variables

# OTEL Configuration
OTEL_SERVICE_NAME=litellm
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production
# Feature flags for custom metrics
UNLEASH_URL=http://unleash:4242
UNLEASH_API_KEY=your-api-key

Instrumentation Features

  • Uses Python OpenTelemetry instrumentation packages
  • Custom metrics guardrail for response pattern matching
  • Dynamic metrics configuration via Unleash feature flags

Next.js Frontend (apps/next)

Server-side rendered React application.

Environment Variables

# OTEL Configuration (optional)
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
OTEL_SDK_DISABLED=true  # Set to disable if not using OTEL collection

Instrumentation Features

  • Uses @vercel/otel for Next.js-specific instrumentation
  • Exports logs to OTEL collector when endpoint is configured

Audit Log Exporter (backend/audit-log-exporter)

Exports audit logs for compliance and security monitoring.

Environment Variables

NODE_ENV=production
# OTEL Configuration
OTEL_SERVICE_NAME=audit-log-exporter
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production

External Poller (backend/external-poller)

Polls external services for integration data.

Environment Variables

NODE_ENV=production
# OTEL Configuration
OTEL_SERVICE_NAME=external-poller
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production

Custom Metrics Reference

Kindo Application Metrics

These custom metrics are emitted by Kindo services and can be queried in your metrics backend (Grafana, Prometheus, etc.).

| Metric Name | Type | Description | Labels |

|-------------|------|-------------|--------|

| kindo_merge_download_total | Counter | Count of Merge download requests | - |

| kindo_workflow_duplicate_transaction_latency_milliseconds | Histogram | Latency of duplicate workflow transactions | - |

| kindo_model_delete_transaction_latency_milliseconds | Histogram | Latency of delete model transactions | - |

| kindo_external_sync_message_processed_total | Counter | External sync messages processed | topic, event, result |

| kindo_api_credits_service_called_total | Counter | API calls to credits service | - |

| kindo_chat_tool_state_conversion_total | Counter | Tool parts converted from incomplete to output-available | - |

Additional Grafana Metrics

These metrics are available in production Grafana dashboards:

| Metric | Description |

|--------|-------------|

| kindo_chat_message_total | Total chat messages processed |

| kindo_token_usage_total | Token consumption tracking |

| kindo_ingestion_bytes_total | Total bytes ingested |

| kindo_ingestion_duration_seconds_bucket | Ingestion operation timing (histogram) |

| kindo_unprocessed_file_count_null_plaintext_key | Files awaiting processing (null key) |

| kindo_unprocessed_file_count_outdated_llama_indexer_ingestion_version | Files needing re-indexing |

| kindo_unprocessed_external_cache_count_null_plaintext_key | External cache items awaiting processing |


Instrumentation Details

Auto-Instrumentation

All Node.js services automatically instrument the following via @opentelemetry/auto-instrumentations-node:

| Library/Framework | What's Traced |

|-------------------|---------------|

| HTTP/HTTPS | Incoming and outgoing requests |

| Express | Route handlers and middleware |

| Prisma | Database queries |

| Redis | Cache operations |

| RabbitMQ | Message queue operations |

| gRPC | Service-to-service calls |

| Winston | Log correlation with trace IDs |

Hatchet Workflow Instrumentation

The custom @kindo/instrumentation-hatchet package provides distributed tracing for Hatchet workflows.

Traced Operations

  • Workflow creation and execution
  • Task creation (durable and non-durable)
  • Step run execution
  • Event publishing
  • Cron job scheduling
  • Admin operations

Span Attributes

| Attribute | Description |

|-----------|-------------|

| hatchet.workflow_name | Name of the workflow |

| hatchet.workflow_run_id | Unique workflow run identifier |

| hatchet.step_name | Current step name |

| hatchet.step_run_id | Step run identifier |

| hatchet.task_type | durable or non_durable |

| hatchet.trigger_type | How workflow was triggered (e.g., cron, event) |

| hatchet.cron_schedule | Cron expression if cron-triggered |

| hatchet.task_duration | Task execution time in milliseconds |

| hatchet.retry_count | Number of retry attempts |

Context Propagation

The instrumentation automatically:

  1. Injects W3C traceparent into workflow/task metadata
  2. Extracts and links traces from incoming action metadata
  3. Preserves distributed trace context across service boundaries

Logging Configuration

Log Format

All backend services use structured JSON logging with automatic trace correlation.

Production Format:

{
  "level": "info",
  "message": "Request processed",
  "trace_id": "abc123def456...",
  "span_id": "789xyz...",
  "userId": "user-1",
  "timestamp": "2024-01-15T10:30:00.000Z"
}

Log Levels

| Level | Description | When to Use |

|-------|-------------|-------------|

| debug | Detailed debugging info | Development only (not shown in production) |

| info | Normal operations | Standard operational events |

| warn | Recoverable issues | Non-critical problems |

| error | Failures requiring attention | Errors that need investigation |

Sensitive Headers Filtered

The following headers are automatically redacted from request logs:

  • authorization
  • cookie
  • x-api-key
  • x-connection-config
  • x-kindo-metadata

Best Practices

1. Service Naming

Use consistent, lowercase, hyphenated names:

| Service | OTEL_SERVICE_NAME |

|---------|---------------------|

| API | api |

| Task Worker | task-worker-ts |

| External Sync | external-sync |

| Credits | credits |

| LiteLLM | litellm |

| Audit Log Exporter | audit-log-exporter |

| External Poller | external-poller |

2. Resource Attributes

Include deployment context for better filtering and grouping:

OTEL_RESOURCE_ATTRIBUTES="deployment.environment=production,service.namespace=kindo,service.version=1.2.3"

3. Sampling Configuration

For high-traffic services, configure sampling to manage data volume:

# Sample 10% of traces
OTEL_TRACES_SAMPLER=traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1

4. Batch Processing Tuning

Adjust batch settings based on your traffic patterns:

# Increase queue size for high-volume services
OTEL_BSP_MAX_QUEUE_SIZE=4096
# Reduce export frequency for lower overhead
OTEL_BSP_SCHEDULE_DELAY=10000

5. Metric Label Cardinality

Avoid high-cardinality labels in metrics:

  • Do not use user IDs, request IDs, or session IDs as metric labels
  • Use span attributes for high-cardinality data instead
  • Stick to fixed, enumerable values for labels (e.g., status=success|error)

Troubleshooting

Traces Not Appearing

| Symptom | Possible Cause | Solution |

|---------|----------------|----------|

| No traces at all | OTEL disabled | Check OTEL_SDK_DISABLED is not true |

| No traces at all | Wrong endpoint | Verify OTEL_EXPORTER_OTLP_ENDPOINT is reachable |

| No traces at all | Sampling disabled | Check OTEL_TRACES_SAMPLER is not always_off |

| Traces disconnected | Context not propagated | Ensure services pass trace headers in HTTP requests |

Missing Span Context in Hatchet Workflows

| Symptom | Possible Cause | Solution |

|---------|----------------|----------|

| Workflow spans not linked | Instrumentation not loaded | Verify @kindo/instrumentation-hatchet is registered |

| Parent spans missing | Metadata not propagated | Check workflow options include trace context |

Metrics Not Exporting

| Symptom | Possible Cause | Solution |

|---------|----------------|----------|

| No metrics in Grafana | Exporter disabled | Set OTEL_METRICS_EXPORTER=otlp |

| Delayed metrics | Long export interval | Reduce OTEL_METRIC_EXPORT_INTERVAL |

| Partial metrics | Collector not configured | Verify collector accepts metrics on configured endpoint |

High Memory Usage

| Symptom | Possible Cause | Solution |

|---------|----------------|----------|

| OOM errors | Large span queue | Reduce OTEL_BSP_MAX_QUEUE_SIZE |

| Gradual memory growth | Too many unique metrics | Review metric cardinality, reduce unique label values |

Log Correlation Issues

| Symptom | Possible Cause | Solution |

|---------|----------------|----------|

| No trace IDs in logs | Wrong logger | Ensure using @kindo/observability logger |

| Intermittent trace IDs | Context lost | Check async operations maintain OTEL context |


Quick Reference

Minimum Required Configuration

# Required for all Node.js services
OTEL_SERVICE_NAME=your-service-name
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318

Recommended Production Configuration

OTEL_SERVICE_NAME=your-service-name
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production,service.version=1.0.0
OTEL_TRACES_SAMPLER=traceidratio
OTEL_TRACES_SAMPLER_ARG=0.5
OTEL_BSP_MAX_QUEUE_SIZE=2048
OTEL_BSP_SCHEDULE_DELAY=5000

Development Configuration (OTEL Disabled)

OTEL_SDK_DISABLED=true

Development Configuration (Console Export for Debugging)

OTEL_SERVICE_NAME=your-service-name
OTEL_TRACES_EXPORTER=console
OTEL_METRICS_EXPORTER=console
OTEL_LOGS_EXPORTER=console

Production Checklist

  • OTEL_SERVICE_NAME set uniquely for each service
  • OTEL_EXPORTER_OTLP_ENDPOINT points to collector
  • OTEL_RESOURCE_ATTRIBUTES includes environment and version
  • Sampling configured appropriately for traffic volume
  • OTEL Collector configured to receive and export telemetry
  • Grafana/Jaeger dashboards configured for visualization
  • Alerts configured for critical metrics

This document is maintained alongside the codebase. Update when adding new services, metrics, or changing observability configuration.