Observability Guide

# Observability Guide

Comprehensive reference for OpenTelemetry configuration and custom metrics in Kindo services

Overview

Kindo uses OpenTelemetry (OTEL) for unified observability across all services. The telemetry stack consists of:

Traces: Distributed tracing for request flow visualization
Metrics: Custom application metrics and standard runtime metrics
Logs: Structured JSON logging with trace correlation

Key Principles

Auto-Instrumentation: Node.js services use @opentelemetry/auto-instrumentations-node for automatic HTTP, database, and framework instrumentation
Custom Instrumentation: Business-critical operations are manually instrumented with custom spans and metrics
Unified Collection: All telemetry flows through the OTEL Collector for processing and export

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Kindo Services                            │
├──────────┬──────────┬──────────┬──────────┬──────────┬──────────┤
│   API    │  Task    │ External │ Credits  │  LiteLLM │  Next.js │
│          │  Worker  │   Sync   │          │ (Python) │          │
└────┬─────┴────┬─────┴────┬─────┴────┬─────┴────┬─────┴────┬─────┘
     │          │          │          │          │          │
     └──────────┴──────────┴──────────┴──────────┴──────────┘
                              │
                    ┌─────────▼─────────┐
                    │   OTEL Collector  │
                    │    (Gateway)      │
                    └─────────┬─────────┘
                              │
              ┌───────────────┼───────────────┐
              │               │               │
         ┌────▼────┐    ┌─────▼─────┐   ┌─────▼─────┐
         │ Grafana │    │  Jaeger/  │   │   Other   │
         │(Metrics)│    │   Tempo   │   │  Backends │
         └─────────┘    │ (Traces)  │   └───────────┘
                        └───────────┘

Environment Variables Reference

Core OTEL Configuration

These environment variables are read directly by the OpenTelemetry SDK and apply to all Node.js services.

|----------|-------------|---------|----------|

| OTEL_BSP_MAX_QUEUE_SIZE | Max spans queued before dropping | 2048 | No |

| OTEL_BSP_SCHEDULE_DELAY | Batch export delay in ms | 5000 | No |

Trace Sampling Configuration

| Variable | Description | Example |

|----------|-------------|---------|

| OTEL_TRACES_SAMPLER | Sampler type | always_on, always_off, traceidratio |

| OTEL_TRACES_SAMPLER_ARG | Sampler argument | 0.1 (for 10% sampling) |

Log Configuration

| Variable | Description | Example |

|----------|-------------|---------|

| OTEL_LOGS_EXPORTER | Logs exporter type | otlp, console, none |

Service-Specific Configuration

API Service (`backend/api`)

The API service is the main backend REST/tRPC API handling user requests.

Environment Variables

# Core service config
PORT=3000
NODE_ENV=production
# OTEL Configuration
OTEL_SERVICE_NAME=api
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production

Instrumentation Features

Auto-instrumentation for HTTP, Express, Prisma, Redis, RabbitMQ
Custom Hatchet instrumentation for workflow triggering
Winston logging with trace ID correlation

Task Worker (`backend/task-worker-ts`)

The Task Worker handles background jobs via Hatchet including AI chat operations.

Environment Variables

# Core service config
NODE_ENV=production
WORKER_TYPE=standard  # or 'large', 'all'
# OTEL Configuration
OTEL_SERVICE_NAME=task-worker-ts
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production

Instrumentation Features

Auto-instrumentation for HTTP, Prisma, Redis
Custom Hatchet SDK instrumentation for distributed workflow tracing
Traces workflow creation, execution, and task completion
Propagates W3C traceparent through workflow metadata

External Sync (`backend/external-sync`)

Handles integration synchronization and content processing.

Environment Variables

NODE_ENV=production
PORT=3000
# OTEL Configuration
OTEL_SERVICE_NAME=external-sync
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production

Custom Metrics Emitted

kindo.external_sync.message.processed - Counter with labels: topic, event, result

Credits Service (`backend/credits`)

Handles credit balance and token usage tracking.

Environment Variables

NODE_ENV=production
PORT=3000
# OTEL Configuration
OTEL_SERVICE_NAME=credits
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production

Custom Metrics Emitted

kindo.api.credits_service_called - Counter for API calls to the credits service

LiteLLM Service (`backend/litellm`)

Python-based LLM proxy service handling model routing and usage tracking.

Environment Variables

# OTEL Configuration
OTEL_SERVICE_NAME=litellm
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production
# Feature flags for custom metrics
UNLEASH_URL=http://unleash:4242
UNLEASH_API_KEY=your-api-key

Instrumentation Features

Uses Python OpenTelemetry instrumentation packages
Custom metrics guardrail for response pattern matching
Dynamic metrics configuration via Unleash feature flags

Next.js Frontend (`apps/next`)

Server-side rendered React application.

Environment Variables

# OTEL Configuration (optional)
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
OTEL_SDK_DISABLED=true  # Set to disable if not using OTEL collection

Instrumentation Features

Uses @vercel/otel for Next.js-specific instrumentation
Exports logs to OTEL collector when endpoint is configured

Audit Log Exporter (`backend/audit-log-exporter`)

Exports audit logs for compliance and security monitoring.

Environment Variables

NODE_ENV=production
# OTEL Configuration
OTEL_SERVICE_NAME=audit-log-exporter
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production

External Poller (`backend/external-poller`)

Polls external services for integration data.

Environment Variables

NODE_ENV=production
# OTEL Configuration
OTEL_SERVICE_NAME=external-poller
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production

Custom Metrics Reference

Kindo Application Metrics

These custom metrics are emitted by Kindo services and can be queried in your metrics backend (Grafana, Prometheus, etc.).

|-------------|------|-------------|--------|

Additional Grafana Metrics

These metrics are available in production Grafana dashboards:

| Metric | Description |

|--------|-------------|

| kindo_chat_message_total | Total chat messages processed |

| kindo_token_usage_total | Token consumption tracking |

| kindo_ingestion_bytes_total | Total bytes ingested |

| kindo_ingestion_duration_seconds_bucket | Ingestion operation timing (histogram) |

| kindo_unprocessed_file_count_null_plaintext_key | Files awaiting processing (null key) |

| kindo_unprocessed_file_count_outdated_llama_indexer_ingestion_version | Files needing re-indexing |

| kindo_unprocessed_external_cache_count_null_plaintext_key | External cache items awaiting processing |

Instrumentation Details

Auto-Instrumentation

All Node.js services automatically instrument the following via @opentelemetry/auto-instrumentations-node:

| Library/Framework | What's Traced |

|-------------------|---------------|

| HTTP/HTTPS | Incoming and outgoing requests |

| Express | Route handlers and middleware |

| Prisma | Database queries |

| Redis | Cache operations |

| RabbitMQ | Message queue operations |

| gRPC | Service-to-service calls |

| Winston | Log correlation with trace IDs |

Hatchet Workflow Instrumentation

The custom @kindo/instrumentation-hatchet package provides distributed tracing for Hatchet workflows.

Traced Operations

Workflow creation and execution
Task creation (durable and non-durable)
Step run execution
Event publishing
Cron job scheduling
Admin operations

Span Attributes

| Attribute | Description |

|-----------|-------------|

| hatchet.workflow_name | Name of the workflow |

| hatchet.workflow_run_id | Unique workflow run identifier |

| hatchet.step_name | Current step name |

| hatchet.step_run_id | Step run identifier |

| hatchet.task_type | durable or non_durable |

| hatchet.trigger_type | How workflow was triggered (e.g., cron, event) |

| hatchet.cron_schedule | Cron expression if cron-triggered |

| hatchet.task_duration | Task execution time in milliseconds |

| hatchet.retry_count | Number of retry attempts |

Context Propagation

The instrumentation automatically:

Injects W3C traceparent into workflow/task metadata
Extracts and links traces from incoming action metadata
Preserves distributed trace context across service boundaries

Logging Configuration

Log Format

All backend services use structured JSON logging with automatic trace correlation.

Production Format:

{
  "level": "info",
  "message": "Request processed",
  "trace_id": "abc123def456...",
  "span_id": "789xyz...",
  "userId": "user-1",
  "timestamp": "2024-01-15T10:30:00.000Z"
}

Log Levels

| Level | Description | When to Use |

|-------|-------------|-------------|

| debug | Detailed debugging info | Development only (not shown in production) |

| info | Normal operations | Standard operational events |

| warn | Recoverable issues | Non-critical problems |

| error | Failures requiring attention | Errors that need investigation |

Sensitive Headers Filtered

The following headers are automatically redacted from request logs:

authorization
cookie
x-api-key
x-connection-config
x-kindo-metadata

Best Practices

1. Service Naming

Use consistent, lowercase, hyphenated names:

| Service | OTEL_SERVICE_NAME |

|---------|---------------------|

| API | api |

| Task Worker | task-worker-ts |

| External Sync | external-sync |

| Credits | credits |

| LiteLLM | litellm |

| Audit Log Exporter | audit-log-exporter |

| External Poller | external-poller |

2. Resource Attributes

Include deployment context for better filtering and grouping:

OTEL_RESOURCE_ATTRIBUTES="deployment.environment=production,service.namespace=kindo,service.version=1.2.3"

3. Sampling Configuration

For high-traffic services, configure sampling to manage data volume:

# Sample 10% of traces
OTEL_TRACES_SAMPLER=traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1

4. Batch Processing Tuning

Adjust batch settings based on your traffic patterns:

# Increase queue size for high-volume services
OTEL_BSP_MAX_QUEUE_SIZE=4096
# Reduce export frequency for lower overhead
OTEL_BSP_SCHEDULE_DELAY=10000

5. Metric Label Cardinality

Avoid high-cardinality labels in metrics:

Do not use user IDs, request IDs, or session IDs as metric labels
Use span attributes for high-cardinality data instead
Stick to fixed, enumerable values for labels (e.g., status=success|error)

Troubleshooting

Traces Not Appearing

| Symptom | Possible Cause | Solution |

|---------|----------------|----------|

| No traces at all | OTEL disabled | Check OTEL_SDK_DISABLED is not true |

| No traces at all | Wrong endpoint | Verify OTEL_EXPORTER_OTLP_ENDPOINT is reachable |

| No traces at all | Sampling disabled | Check OTEL_TRACES_SAMPLER is not always_off |

| Traces disconnected | Context not propagated | Ensure services pass trace headers in HTTP requests |

Missing Span Context in Hatchet Workflows

| Symptom | Possible Cause | Solution |

|---------|----------------|----------|

| Workflow spans not linked | Instrumentation not loaded | Verify @kindo/instrumentation-hatchet is registered |

| Parent spans missing | Metadata not propagated | Check workflow options include trace context |

Metrics Not Exporting

| Symptom | Possible Cause | Solution |

|---------|----------------|----------|

| No metrics in Grafana | Exporter disabled | Set OTEL_METRICS_EXPORTER=otlp |

| Delayed metrics | Long export interval | Reduce OTEL_METRIC_EXPORT_INTERVAL |

| Partial metrics | Collector not configured | Verify collector accepts metrics on configured endpoint |

High Memory Usage

| Symptom | Possible Cause | Solution |

|---------|----------------|----------|

| OOM errors | Large span queue | Reduce OTEL_BSP_MAX_QUEUE_SIZE |

| Gradual memory growth | Too many unique metrics | Review metric cardinality, reduce unique label values |

Log Correlation Issues

| Symptom | Possible Cause | Solution |

|---------|----------------|----------|

| No trace IDs in logs | Wrong logger | Ensure using @kindo/observability logger |

| Intermittent trace IDs | Context lost | Check async operations maintain OTEL context |

Quick Reference

Minimum Required Configuration

# Required for all Node.js services
OTEL_SERVICE_NAME=your-service-name
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318

Recommended Production Configuration

OTEL_SERVICE_NAME=your-service-name
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production,service.version=1.0.0
OTEL_TRACES_SAMPLER=traceidratio
OTEL_TRACES_SAMPLER_ARG=0.5
OTEL_BSP_MAX_QUEUE_SIZE=2048
OTEL_BSP_SCHEDULE_DELAY=5000

Development Configuration (OTEL Disabled)

OTEL_SDK_DISABLED=true

Development Configuration (Console Export for Debugging)

OTEL_SERVICE_NAME=your-service-name
OTEL_TRACES_EXPORTER=console
OTEL_METRICS_EXPORTER=console
OTEL_LOGS_EXPORTER=console

Production Checklist

OTEL_SERVICE_NAME set uniquely for each service
OTEL_EXPORTER_OTLP_ENDPOINT points to collector
OTEL_RESOURCE_ATTRIBUTES includes environment and version
Sampling configured appropriately for traffic volume
OTEL Collector configured to receive and export telemetry
Grafana/Jaeger dashboards configured for visualization
Alerts configured for critical metrics

This document is maintained alongside the codebase. Update when adding new services, metrics, or changing observability configuration.

Observability Guide

Table of Contents

Overview

Key Principles

Architecture

Environment Variables Reference

Core OTEL Configuration

Trace Sampling Configuration

Log Configuration

Service-Specific Configuration

API Service (backend/api)

Environment Variables

Instrumentation Features

Task Worker (backend/task-worker-ts)

Environment Variables

Instrumentation Features

External Sync (backend/external-sync)

Environment Variables

Custom Metrics Emitted

Credits Service (backend/credits)

Environment Variables

Custom Metrics Emitted

LiteLLM Service (backend/litellm)

Environment Variables

Instrumentation Features

Next.js Frontend (apps/next)

Environment Variables

Instrumentation Features

Audit Log Exporter (backend/audit-log-exporter)

Environment Variables

External Poller (backend/external-poller)

Environment Variables

Custom Metrics Reference

Kindo Application Metrics

Additional Grafana Metrics

Instrumentation Details

Auto-Instrumentation

Hatchet Workflow Instrumentation

Traced Operations

Span Attributes

Context Propagation

Logging Configuration

Log Format

Log Levels

Sensitive Headers Filtered

Best Practices

1. Service Naming

2. Resource Attributes

3. Sampling Configuration

4. Batch Processing Tuning

5. Metric Label Cardinality

Troubleshooting

Traces Not Appearing

Missing Span Context in Hatchet Workflows

Metrics Not Exporting

High Memory Usage

Log Correlation Issues

Quick Reference

Minimum Required Configuration

Recommended Production Configuration

Development Configuration (OTEL Disabled)

Development Configuration (Console Export for Debugging)

Production Checklist

API Service (`backend/api`)

Task Worker (`backend/task-worker-ts`)

External Sync (`backend/external-sync`)

Credits Service (`backend/credits`)

LiteLLM Service (`backend/litellm`)

Next.js Frontend (`apps/next`)

Audit Log Exporter (`backend/audit-log-exporter`)

External Poller (`backend/external-poller`)