Skip to main content

Command Palette

Search for a command to run...

Runbook Automation: From 45-Minute Fixes to 90-Second Recoveries

Published
3 min read
Runbook Automation: From 45-Minute Fixes to 90-Second Recoveries
S
Building Nova AI Ops — The AI-native platform replacing 12 monitoring tools. Deep-dives on alert fatigue, runbook automation, post-mortems, Kubernetes, and everything operational engineers actually deal with at 3 AM.

The Runbook Nobody Reads

We had runbooks. Beautiful, detailed, Google-Docs runbooks. 47 pages long. Nobody read them at 3am.

The problem isn't the documentation. The problem is expecting a sleep-deprived human to follow a 47-step procedure correctly.

The Automation Ladder

I think about runbook automation as a ladder:

Level 0: No runbook (tribal knowledge)
Level 1: Written runbook (Google Doc)
Level 2: Structured runbook (checklist format)
Level 3: Semi-automated (scripts for each step)
Level 4: Fully automated (one-click remediation)
Level 5: Self-healing (no human needed)

Most teams are at Level 1-2. The goal is Level 4-5 for your top 10 incidents.

Identifying Automation Candidates

Not everything should be automated. Start with high-frequency, well-understood procedures:

-- Query your incident database
SELECT 
  root_cause_category,
  COUNT(*) as frequency,
  AVG(resolution_time_minutes) as avg_mttr,
  COUNT(*) * AVG(resolution_time_minutes) as total_impact_minutes
FROM incidents
WHERE created_at > NOW() - INTERVAL '6 months'
GROUP BY root_cause_category
ORDER BY total_impact_minutes DESC
LIMIT 10;

For us, the top 5 were:

  1. Disk full on log volumes (2x/week)
  2. Memory leak requiring pod restart (1x/week)
  3. Certificate expiry (1x/month, but high impact)
  4. Database connection pool exhaustion (1x/week)
  5. Stuck deployment (2x/week)

Example: Disk Full Auto-Remediation

Before (Level 1 — runbook):

1. SSH to the affected host
2. Run df -h to confirm
3. Check /var/log for large files
4. Run logrotate manually
5. If still full, find and remove old files
6. If still full, expand the volume
7. Verify service recovered

After (Level 5 — self-healing):

#!/bin/bash
# disk-remediation.sh — triggered by monitoring alert

HOST=$1
THRESHOLD=90

USAGE=$(ssh $HOST "df /var/log --output=pcent | tail -1 | tr -dc '0-9'")

if [ "$USAGE" -gt "$THRESHOLD" ]; then
  echo "[Auto-Remediation] Disk at ${USAGE}% on ${HOST}"

  # Step 1: Rotate logs
  ssh $HOST "sudo logrotate -f /etc/logrotate.conf"

  # Step 2: Clean old logs (>7 days)
  ssh $HOST "find /var/log -name '*.gz' -mtime +7 -delete"

  # Step 3: Clean temp files
  ssh $HOST "find /tmp -mtime +3 -delete 2>/dev/null"

  # Verify
  NEW_USAGE=$(ssh $HOST "df /var/log --output=pcent | tail -1 | tr -dc '0-9'")

  if [ "$NEW_USAGE" -lt "$THRESHOLD" ]; then
    echo "[Auto-Remediation] Resolved. ${USAGE}% -> ${NEW_USAGE}%"
    notify_slack "Disk full on ${HOST} auto-resolved (${USAGE}% -> ${NEW_USAGE}%)"
  else
    echo "[Auto-Remediation] Still at ${NEW_USAGE}%. Escalating."
    page_oncall "Disk full on ${HOST} - auto-remediation failed. Manual intervention needed."
  fi
fi

The Results

Incident TypeBefore (MTTR)After (MTTR)Automation Level
Disk full25 min90 secSelf-healing
Memory leak15 min45 secOne-click
Cert expiry45 min0 (prevented)Proactive
DB conn pool20 min60 secSelf-healing
Stuck deploy30 min2 minOne-click

Total monthly incident time: 14 hours → 45 minutes.

The Golden Rule

If you've fixed the same incident three times manually, it's time to automate. The third time pays for the automation effort. Everything after that is pure savings.

If you're tired of repetitive incident response and want to automate your runbooks with AI, check out what we're building at Nova AI Ops.


Written by Dr. Samson Tanimawo BSc · MSc · MBA · PhD Founder & CEO, Nova AI Ops. https://novaaiops.com

More from this blog

N

Nova AI Ops Blog — SRE, Observability & Incident Response

58 posts

Honest, practical writing on SRE, observability, and incident response from the team building Nova AI Ops — the AI-native platform replacing 12 monitoring tools. Deep-dives on alert fatigue, runbook automation, post-mortems, Kubernetes, and everything operational engineers actually deal with at 3 AM.