AWS Infrastructure Deployment Guide

Prev Next

This guide walks you through deploying the Kindo infrastructure using the kindo-infra module.

Table of Contents

  1. Quick Start

  2. Understanding terraform.tfvars

  3. Core Configuration Parameters

  4. Deployment Sizing

  5. Network Configuration

  6. Feature Toggles

  7. Advanced Configuration

  8. Terraform Deployment Process

  9. Post-Deployment Verification

  10. Troubleshooting

Quick Start

1. Set Up Your Deployment Directory

# Clone the Kindo modules repository
git clone <https://github.com/your-org/kindo-modules.git>
cd kindo-modules/examples/kindo-infra-aws-example
# Copy the example configuration
cp terraform.tfvars.example terraform.tfvars
# Edit your configuration
vi terraform.tfvars

2. Minimal Configuration

Here’s the absolute minimum you need to configure in terraform.tfvars:

# Core configuration (REQUIRED)
project_name = "mycompany"          # Your project identifier
environment  = "production"         # Environment name (dev/staging/production)
aws_region   = "us-west-2"          # AWS region for deployment

# Deployment sizing (REQUIRED)
deployment_size = "small"            # Choose: dev, small, medium, large, xlarge

# Network configuration (REQUIRED)
vpc_cidr = "10.0.0.0/16"            # Must not conflict with existing networks

# DNS configuration (REQUIRED)
domain_name = "kindo.mycompany.com" # Your domain for the application

3. Deploy

# Set your AWS profile
export AWS_PROFILE=your-aws-profile
# Initialize Terraform
terraform init
# Review the plan
terraform plan
# Deploy
terraform apply

Understanding terraform.tfvars

The terraform.tfvars file is your primary configuration interface. Each parameter controls specific aspects of your infrastructure deployment. Let’s explore each section in detail. Unlike variables.tf that require explicit variable declaration, .tfvars files are automatically interpolated without explicit variable declarations.

Core Configuration Parameters

Project Identification

project_name = "mycompany"
environment  = "production"

What these do:

  • project_name: Creates a namespace for all resources. Used in resource naming (e.g., mycompany-production-eks-cluster)

  • environment: Distinguishes between deployments (dev/staging/production). Affects resource naming and default configurations

Best practices:

  • Use lowercase, no spaces or special characters

  • Keep project_name consistent across all environments

  • Use standard environment names: dev, staging, production

AWS Region

aws_region = "us-west-2"

What this does:

  • Determines where all resources are deployed

  • Affects latency for your users

  • Influences service availability and pricing

How to choose:

  • Pick a region close to your users

  • Ensure all required services are available (EKS, RDS, ElastiCache, MQ)

  • Consider data residency requirements

  • Check AWS pricing for your chosen region

Deployment Sizing

T-Shirt Sizing Model

deployment_size = "small"
production_mode = false

Available sizes and what they configure:

Size

Best For

Resources

Monthly Cost

dev

Development/Testing

• 1-3 t3.medium nodes• db.t3.micro RDS• No HA

~$150-200

small

Small Production

• 2-5 t3.large nodes• db.t3.small RDS• Optional HA

~$400-500

medium

Standard Production

• 3-8 t3.xlarge nodes• db.t3.medium RDS• HA enabled

~$800-1000

large

High-Traffic

• 5-15 m5.xlarge nodes• db.m5.large RDS• Full HA

~$2000-3000

xlarge

Enterprise

• 10-30 m5.2xlarge nodes• db.m5.xlarge RDS• Full HA + backup

~$5000+

What production_mode does:

  • true: Enables deletion protection on RDS, prevents accidental terraform destroy

  • false: Allows easy cleanup for development environments

Overriding Size Defaults

If the t-shirt sizes don’t fit perfectly, you can override specific settings:

deployment_size = "small"

# Override specific node group settings
general_min_size       = 3  # Instead of default 2 for small
general_max_size       = 10 # Instead of default 5 for small
general_instance_types = ["t3.xlarge"] # Instead of t3.large

# Override database settings
postgres_main_instance_class = "db.t3.medium" # Instead of db.t3.small
postgres_main_allocated_storage = 50 # Instead of 20GB

Network Configuration

VPC CIDR

vpc_cidr = "10.0.0.0/16"

What this does:

  • Creates a VPC with the specified IP range

  • Automatically creates subnets across 3 availability zones

  • Must not conflict with:

    • Other VPCs in your account

    • On-premises networks (if using VPN/Direct Connect)

    • Peered VPCs

Subnet allocation (automatic):

10.0.0.0/16 creates:
  Public:   10.0.0.0/20, 10.0.16.0/20, 10.0.32.0/20  (3 AZs)
  Private:  10.0.48.0/20, 10.0.64.0/20, 10.0.80.0/20 (3 AZs)
  Database: Created within private subnets

Availability Zones

# Usually not needed - auto-detected
# availability_zones = ["us-west-2a", "us-west-2b", "us-west-2c"]

What this does:

  • Specifies which AZs to use for high availability

  • Default: Automatically uses first 3 AZs in your region

  • Only override if you have specific AZ requirements

Feature Toggles

DNS and SSL Configuration

domain_name        = "kindo.mycompany.com"
create_public_zone = true

What these do:

  • domain_name: Base domain for your application

  • Creates subdomains: app.kindo.mycompany.com, api.kindo.mycompany.com

  • Used for SSL certificate generation

  • create_public_zone:

  • true: Creates Route53 hosted zone (you’ll need to update NS records)

  • false: Use existing Route53 zone or external DNS

Audit Logging

enable_syslog = true

What this does:

enable_syslog:

  • true: Deploys managed audit log infrastructure

  • EC2 instances with rsyslog for audit log collection

  • Kinesis Firehose for secure audit log streaming

  • S3 storage with lifecycle policies for compliance

  • Network Load Balancer providing syslog endpoint for applications

  • Used by Kindo applications to store audit logs (user actions, security events)

  • false: You must provide external syslog endpoint for audit logging

Important: This is a critical compliance feature. Audit logs track all user actions and security events within the Kindo platform.

Monitoring and Observability

enable_monitoring = true

What this does:

enable_monitoring:

  • true: Deploys ADOT (AWS Distro for OpenTelemetry)

  • Collects application metrics and traces

  • Sends to AWS X-Ray and CloudWatch

  • Configures Container Insights for cluster monitoring

  • Provides performance and operational visibility

  • false: Use your existing monitoring solution (Datadog, New Relic, etc.)

Email Service

enable_ses = true
# If false, configure SMTP settings in kindo-secrets module

What this does:

  • true: Configures Amazon SES for email

  • Creates SES domain identity

  • Sets up DKIM verification

  • Configures IAM permissions

  • false: You’ll provide SMTP configuration later

VPN Access

enable_vpn = false
# vpn_client_cidr_block = "10.100.0.0/16"  # If enabled

What this does:

  • true: Creates AWS Client VPN endpoint

  • Secure access to private resources

  • Requires certificate generation

  • Additional cost (~$0.10/hour + connection fees)

  • false: Access via public endpoints only

Additional Node Groups

memory_node_group_enabled  = true
compute_node_group_enabled = false

What these do:

  • memory_node_group_enabled: Adds memory-optimized nodes (r6i family)

  • For memory-intensive workloads

  • Uses ONDEMAND instances by default

  • compute_node_group_enabled: Adds compute-optimized nodes (c6i family)

  • For CPU-intensive workloads

  • Uses ONDEMAND instances by default

Advanced Configuration

Kubernetes Version

kubernetes_version = "1.32"

Considerations:

  • Use latest stable version for new deployments

  • Check compatibility with your applications

  • Plan for regular upgrades (every 3-4 months)

PostgreSQL Configuration

# Database versions
postgres_engine_version = "17.4"

# User management
manage_postgres_users = true  # Auto-provision database users
# skip_postgres_resource_destroy = true  # Helps with clean destroy

# Performance tuning
postgres_main_max_connections = 200
postgres_auxiliary_max_connections = 100

# Storage encryption
postgres_storage_encrypted = true

# Backup settings
postgres_backup_retention_period = 7
postgres_backup_window = "03:00-04:00"
postgres_maintenance_window = "sun:04:00-sun:05:00"

# High Availability
postgres_main_multi_az = true       # For production
postgres_auxiliary_multi_az = false # Can be false for cost savings

Database User Provisioning: When manage_postgres_users = true, the module automatically:

  • Creates dedicated databases and users for each application:

  • Main RDS: kindo database and user for the main application

  • Auxiliary RDS:

  • unleash database and user for feature flags

  • litellm database and user for LLM proxy

  • nango database and user for integrations

  • Generates secure passwords stored in AWS Secrets Manager

  • Sets appropriate permissions for each user

  • Uses ephemeral SSM tunnels for secure provisioning

Redis Configuration

# Redis version
redis_engine_version = "7.0"

# Node type (affects memory and performance)
redis_node_type = "cache.t3.small"

# Cluster mode
redis_number_cache_clusters = 2  # For HA
redis_automatic_failover_enabled = true

# Encryption
redis_transit_encryption_enabled = true
redis_at_rest_encryption_enabled = true

RabbitMQ Configuration

# Version
rabbitmq_engine_version = "3.13"

# Instance type
rabbitmq_instance_type = "mq.t3.micro"

# Deployment mode
rabbitmq_deployment_mode = "SINGLE_INSTANCE"  # or "CLUSTER_MULTI_AZ" for HA

Storage Configuration

# S3 bucket names (auto-generated if not specified)
# s3_uploads_bucket_name = "mycompany-kindo-uploads"
# s3_audit_logs_bucket_name = "mycompany-kindo-audit"

# Audit log retention (compliance requirement)
syslog_log_retention_days = 365  # Minimum 1 year for compliance
# syslog_s3_transition_ia_days = 30      # Move to Infrequent Access after 30 days
# syslog_s3_transition_glacier_days = 90 # Move to Glacier after 90 days

Security and Compliance

# Deletion protection
postgres_deletion_protection = true  # Prevent accidental deletion
postgres_skip_final_snapshot = false # Take snapshot before deletion

# Network security
cluster_endpoint_public_access = true   # For development
cluster_endpoint_private_access = true  # Always recommended

# Encryption
enable_s3_encryption = true
kms_key_deletion_window_in_days = 30

Resource Tags

tags = {
  Owner       = "platform-team"
  CostCenter  = "engineering"
  Environment = "production"
  Compliance  = "sox"
}

Best practices for tags:

  • Use for cost allocation

  • Include compliance markers

  • Add contact information

  • Keep consistent across resources

Terraform Deployment Process

1. Pre-Deployment Checklist

Before running Terraform:

  • [x] AWS credentials configured (aws sts get-caller-identity)

  • [x] VPC CIDR doesn’t conflict with existing networks

  • [x] Domain name is available

  • [x] Required service quotas are available

  • [x] terraform.tfvars is complete

2. Initialize Terraform

cd examples/kindo-infra-aws-example
# Initialize Terraform
terraform init
# Format configuration files
terraform fmt
# Validate configuration
terraform validate

3. Review the Terraform Plan

# Generate and review the plan
terraform plan -out=tfplan
# For a specific deployment size
terraform plan -var="deployment_size=medium" -out=tfplan
# Save plan for review
terraform show -no-color tfplan > plan.txt

4. Deploy Infrastructure

# Apply the configuration
terraform apply tfplan
# Or directly (will prompt for confirmation)
terraform apply
# For automated deployments (CI/CD)
terraform apply -auto-approve

5. Monitor Deployment

The deployment typically takes:

  • VPC and networking: 2-3 minutes

  • EKS cluster: 10-15 minutes

  • RDS instances: 10-15 minutes

  • Total: 25-35 minutes

Watch for:

  • EKS cluster becoming ACTIVE

  • RDS instances becoming available

  • Load balancers becoming active

Post-Deployment Verification

1. Review Critical Outputs

# Display all outputs
terraform output
# Save outputs for next steps
terraform output -json > infrastructure-outputs.json

Key Outputs and Their Purpose:

Output

Purpose

Action Required

route53_name_servers

NS records for your domain

⚠️ ACTION: Update your domain registrar

ses_verification_token

SES domain verification

⚠️ ACTION: May need to verify domain

syslog_endpoint

Audit log endpoint

Note for application configuration

eks_cluster_name

Cluster identifier

Used for kubectl configuration

postgres_*_connection_string

Database connections

Automatically used by secrets module

redis_connection_string

Cache connection

Automatically used by secrets module

rabbitmq_connection_string

Message queue

Automatically used by secrets module

vpc_id

Network identifier

May need for additional resources

alb_dns_name

Load balancer endpoint

Will be mapped to your domain

2. Required Post-Deployment Actions

DNS Delegation (if create_public_zone = true)

⚠️ IMMEDIATE ACTION REQUIRED:

# Get the Route53 name servers
terraform output route53_name_servers
# Example output:
# [
#   "ns-123.awsdns-12.com",
#   "ns-456.awsdns-34.net",
#   "ns-789.awsdns-56.org",
#   "ns-012.awsdns-78.co.uk"
# ]

What to do:

  1. Log into your domain registrar (GoDaddy, Namecheap, etc.)

  2. Find DNS/Nameserver settings for your domain

  3. Replace existing nameservers with the AWS Route53 nameservers above

  4. Wait for DNS propagation (5-30 minutes typically)

Verify DNS delegation:

# Check if DNS is properly delegated
dig +short NS $(terraform output -raw domain_name)
# Should return the Route53 name servers

Amazon SES Setup (if enable_ses = true)

⚠️ ACTION MAY BE REQUIRED:

# Check SES verification status
aws ses get-identity-verification-attributes \\
  --identities $(terraform output -raw domain_name) \\
  --region $(terraform output -raw aws_region)
# Check if account is in sandbox mode
aws ses describe-configuration-set \\
  --configuration-set-name default \\
  --region $(terraform output -raw aws_region) 2>/dev/null || \\
  echo "⚠️  SES is likely in sandbox mode"

If SES is in sandbox mode:

  1. For testing: Add verified email addresses

aws ses verify-email-identity \\
  --email-address [email protected] \\
  --region $(terraform output -raw aws_region)

For production: Request production access

  • Open AWS Support Center

  • Create case: “Service limit increase”

  • Service: “Simple Email Service (SES)”

  • Limit type: “Sending Quota”

  • Provide use case and expected volume

  • Typical approval time: 24-48 hours

3. Configure kubectl Access

# Update kubeconfig
aws eks update-kubeconfig \\
  --region $(terraform output -raw aws_region) \\
  --name $(terraform output -raw eks_cluster_name) \\
  --profile ${AWS_PROFILE}
# Verify cluster accesskubectl get nodes
kubectl get pods -A

Expected output:

  • 2-5 nodes in Ready state (based on deployment_size)

  • Core system pods running in kube-system namespace

4. Verify Database Provisioning

If manage_postgres_users = true:

# Check created databases
echo "Main RDS databases:"
terraform output postgres_main_databases
echo "Auxiliary RDS databases:"
terraform output postgres_auxiliary_databases
# Verify secrets were created
aws secretsmanager list-secrets \\
  --filters "Key=name,Values=kindo-${ENVIRONMENT}-postgres" \\
  --query 'SecretList[*].Name' \\
  --region $(terraform output -raw aws_region)

Expected databases:

  • Main RDS: kindo (main application database)

  • Auxiliary RDS: unleash, litellm, nango

5. Verify Audit Logging Setup

If enable_syslog = true:

# Get syslog endpoint
SYSLOG_ENDPOINT=$(terraform output -raw syslog_endpoint)
echo "Syslog endpoint: $SYSLOG_ENDPOINT"
# Test syslog connectivity (from a machine with network access)
echo "<14>$(date '+%b %d %H:%M:%S') test-host kindo-test: Test audit log" | \\
  nc -u -w1 $SYSLOG_ENDPOINT 514
# Check if logs are being stored in S3
aws s3 ls s3://$(terraform output -raw audit_logs_bucket_name)/syslog/ \\
  --region $(terraform output -raw aws_region)

6. Verify Monitoring Setup

If enable_monitoring = true:

# Check ADOT addon status
kubectl get pods -n opentelemetry-operator-system
# Verify X-Ray service map (after applications are deployed)
aws xray get-service-graph \\
  --start-time $(date -u -d '5 minutes ago' +%s) \\
  --end-time $(date +%s) \\
  --region $(terraform output -raw aws_region)

7. Security Verification

# Verify encryption is enabled
echo "RDS Encryption Status:"
aws rds describe-db-instances \\
  --query 'DBInstances[*].[DBInstanceIdentifier,StorageEncrypted]' \\
  --output table \\
  --region $(terraform output -raw aws_region)
echo "Redis Encryption Status:"
aws elasticache describe-replication-groups \\
  --query 'ReplicationGroups[*].[ReplicationGroupId,TransitEncryptionEnabled,AtRestEncryptionEnabled]' \\
  --output table \\
  --region $(terraform output -raw aws_region)

8. Document Important Endpoints

Create a deployment summary for your team:

cat > deployment-summary.md << EOF# Kindo Infrastructure Deployment Summary## Deployment Information- **Date**: $(date)- **Environment**: $(terraform output -raw environment)- **Region**: $(terraform output -raw aws_region)- **Deployment Size**: $(terraform output -raw deployment_size)## Critical Endpoints- **EKS Cluster**: $(terraform output -raw eks_cluster_name)- **Domain**: $(terraform output -raw domain_name)- **Syslog Endpoint**: $(terraform output -raw syslog_endpoint)## Database Information- **Main RDS**: $(terraform output -raw postgres_main_endpoint)- **Auxiliary RDS**: $(terraform output -raw postgres_auxiliary_endpoint)## Action Items- [ ] DNS nameservers updated at registrar- [ ] SES production access requested (if needed)- [ ] Team members granted kubectl access- [ ] Monitoring dashboards configured## Next Steps1. Deploy secrets using kindo-secrets module2. Deploy applications using kindo-applications module3. Configure peripherals using kindo-peripheries moduleEOFecho "✅ Deployment summary saved to deployment-summary.md"

Troubleshooting

Common Issues and Solutions

1. VPC CIDR Conflicts

Error: “The CIDR ‘10.0.0.0/16’ conflicts with a CIDR of another VPC”

Solution:

# Use a different CIDR range
vpc_cidr = "10.1.0.0/16"  # or "172.16.0.0/16"

2. Service Quotas Exceeded

Error: “You have exceeded the maximum number of DBInstances”

Solution:

# Check current quotas
aws service-quotas get-service-quota \\
  --service-code rds \\
  --quota-code L-952B80B9
# Request increase via AWS Console or CLI

3. EKS Addon Version Incompatibility

Error: “Addon version is not compatible with cluster version”

Solution:

# List compatible addon versions
aws eks describe-addon-versions \\
  --addon-name vpc-cni \\
  --kubernetes-version 1.32 \\
  --query 'addons[0].addonVersions[*].addonVersion'

4. PostgreSQL Connection Issues

Error: “PostgreSQL tunnel failed to start”

Solution:

# Option 1: Skip user management during destroy
skip_postgres_resource_destroy = true

# Option 2: Ensure VPC endpoints are created
create_vpc_endpoints = true

5. Insufficient Node Capacity

Error: “Pod failed to schedule: Insufficient CPU/memory”

Solution:

# Increase node group size
general_min_size = 3
general_max_size = 10

# Or add specialized node groups
memory_node_group_enabled = true

Debugging Commands

# Check Terraform state
terraform state list
terraform state show <resource>
# Force resource recreation
terraform taint <resource>
terraform apply
# Clean up failed resources
terraform destroy -target=<resource>
# Enable debug logging
export TF_LOG=DEBUG
terraform apply

Getting Help

If you encounter issues:

  1. Check the troubleshooting guide

  2. Review CloudWatch logs

  3. Check AWS CloudTrail for API errors

  4. Contact support with:

    • Terraform version (terraform version)

    • Error messages

    • terraform.tfvars (sanitized)

    • Terraform state list

Next Steps

After successful infrastructure deployment:

  1. Deploy Secrets: Follow the Secrets Configuration Guide to generate and deploy application secrets

  2. Deploy Applications: Use the Application Deployment Guide to deploy Kindo applications

  3. Configure Peripherals: Set up additional services using the Peripherals Configuration Guide

  4. Set Up Monitoring: Configure observability using the Monitoring Setup Guide

Appendix: Complete terraform.tfvars Reference

For reference, here’s a complete terraform.tfvars with all available options:

# ===============================================================
# Complete terraform.tfvars Reference
# ===============================================================

# Core Configuration (REQUIRED)
project_name = "mycompany"
environment  = "production"
aws_region   = "us-west-2"

# Deployment Sizing (REQUIRED)
deployment_size = "medium"  # dev, small, medium, large, xlarge
production_mode = true      # Enable deletion protection

# Network Configuration (REQUIRED)
vpc_cidr = "10.0.0.0/16"
# availability_zones = ["us-west-2a", "us-west-2b", "us-west-2c"] # Auto-detected

# DNS Configuration (REQUIRED)
domain_name        = "kindo.mycompany.com"
create_public_zone = true

# Feature Toggles
enable_syslog     = true
enable_monitoring = true
enable_ses        = true
enable_vpn        = false

# Additional Node Groups
memory_node_group_enabled  = true
compute_node_group_enabled = false

# Kubernetes Configuration
kubernetes_version = "1.32"
cluster_endpoint_public_access  = true
cluster_endpoint_private_access = true

# Database Configuration
postgres_engine_version = "17.4"
postgres_deletion_protection = true
postgres_skip_final_snapshot = false
postgres_backup_retention_period = 7
postgres_backup_window = "03:00-04:00"
postgres_maintenance_window = "sun:04:00-sun:05:00"

# Override size defaults (optional)
# general_min_size = 3
# general_max_size = 10
# general_instance_types = ["t3.xlarge"]
# postgres_main_instance_class = "db.t3.medium"
# postgres_main_allocated_storage = 50
# redis_node_type = "cache.t3.medium"

# Advanced Options
create_vpc_endpoints = true
skip_postgres_resource_destroy = true

# Resource Tags
tags = {
  Owner       = "platform-team"
  CostCenter  = "engineering"
  Environment = "production"
}

Remember: Start with the minimal configuration and add parameters as needed. The t-shirt sizing handles most defaults appropriately for your deployment size.