Skip to main content

Command Palette

Search for a command to run...

Terraform at Scale: Lessons from Managing 500+ Resources

Published
3 min read
Terraform at Scale: Lessons from Managing 500+ Resources
S
Building Nova AI Ops — The AI-native platform replacing 12 monitoring tools. Deep-dives on alert fatigue, runbook automation, post-mortems, Kubernetes, and everything operational engineers actually deal with at 3 AM.

When Terraform Gets Slow

Our Terraform state file grew to 500+ resources. Plans took 8 minutes. Applies timed out. State locking conflicts were daily. Something had to change.

Here's how we tamed it.

Problem 1: Monolithic State

Everything was in one state file. VPCs, databases, Kubernetes clusters, DNS, IAM — all in one giant blob.

Before: 1 state file, 500+ resources
  terraform plan: 8 minutes
  terraform apply: timeout risk
  blast radius: everything

Solution: State Decomposition

infrastructure/
├── network/          # VPCs, subnets, security groups
├── data/             # RDS, ElastiCache, S3
├── compute/          # EKS, ASGs, Launch templates
├── dns/              # Route53 zones and records
├── iam/              # Roles, policies, users
└── monitoring/       # CloudWatch, SNS topics

Each directory = separate state file. Use data sources to reference across boundaries:

# compute/main.tf
data "terraform_remote_state" "network" {
  backend = "s3"
  config = {
    bucket = "terraform-state"
    key    = "network/terraform.tfstate"
    region = "us-east-1"
  }
}

resource "aws_eks_cluster" "main" {
  vpc_config {
    subnet_ids = data.terraform_remote_state.network.outputs.private_subnet_ids
  }
}

Result: 6 state files, 60-100 resources each. Plan time: 45 seconds.

Problem 2: Environment Drift

Dev, staging, and prod drifted constantly because each was copy-pasted.

Solution: Modules + Terragrunt

modules/
├── eks-cluster/
│   ├── main.tf
│   ├── variables.tf
│   └── outputs.tf
└── rds-instance/
    ├── main.tf
    ├── variables.tf
    └── outputs.tf

environments/
├── dev/
│   └── terragrunt.hcl
├── staging/
│   └── terragrunt.hcl
└── prod/
    └── terragrunt.hcl
# environments/prod/terragrunt.hcl
terraform {
  source = "../../modules/eks-cluster"
}

inputs = {
  cluster_name    = "prod-main"
  node_count      = 10
  instance_type   = "m5.2xlarge"
  multi_az        = true
}

Problem 3: Dangerous Applies

Anyone could terraform apply to production from their laptop.

Solution: CI/CD Only

# .github/workflows/terraform.yml
name: Terraform
on:
  pull_request:
    paths: ['infrastructure/**']
  push:
    branches: [main]
    paths: ['infrastructure/**']

jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - uses: hashicorp/setup-terraform@v3
      - run: terraform init
      - run: terraform plan -out=plan.tfplan
      - run: terraform show -json plan.tfplan > plan.json
      # Post plan as PR comment
      - uses: actions/github-script@v7
        with:
          script: |
            const plan = require('./plan.json');
            const adds = plan.resource_changes.filter(c => c.change.actions.includes('create')).length;
            const changes = plan.resource_changes.filter(c => c.change.actions.includes('update')).length;
            const deletes = plan.resource_changes.filter(c => c.change.actions.includes('delete')).length;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              body: `## Terraform Plan\n+${adds} ~${changes} -${deletes}\n\n${deletes > 0 ? '⚠ RESOURCES WILL BE DESTROYED' : ''}`
            });

  apply:
    needs: plan
    if: github.ref == 'refs/heads/main'
    environment: production  # Requires approval
    steps:
      - run: terraform apply plan.tfplan

Problem 4: State Locks

Multiple engineers running plan simultaneously caused state lock conflicts.

Solution: Remote State with DynamoDB Locking

terraform {
  backend "s3" {
    bucket         = "terraform-state"
    key            = "network/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

Plus: only CI/CD runs apply. Humans run plan locally with -lock=false for quick checks.

Results

MetricBeforeAfter
Plan time8 min45 sec
Apply failures3/week0.5/week
State conflictsDailyNever
Env drift incidentsMonthlyNone in 6 months
Time to provision new env2 days30 minutes

If you want AI-powered infrastructure management that catches drift before it causes outages, check out what we're building at Nova AI Ops.


Written by Dr. Samson Tanimawo BSc · MSc · MBA · PhD Founder & CEO, Nova AI Ops. https://novaaiops.com

More from this blog

N

Nova AI Ops Blog — SRE, Observability & Incident Response

58 posts

Honest, practical writing on SRE, observability, and incident response from the team building Nova AI Ops — the AI-native platform replacing 12 monitoring tools. Deep-dives on alert fatigue, runbook automation, post-mortems, Kubernetes, and everything operational engineers actually deal with at 3 AM.