Skip to main content

Command Palette

Search for a command to run...

The On-Call Handoff That Prevents Dropped Incidents

Published
3 min read
The On-Call Handoff That Prevents Dropped Incidents
S
Building Nova AI Ops — The AI-native platform replacing 12 monitoring tools. Deep-dives on alert fatigue, runbook automation, post-mortems, Kubernetes, and everything operational engineers actually deal with at 3 AM.

The Monday Morning Disaster

Every Monday, the same story: the incoming on-call engineer has no idea what happened over the weekend. The outgoing engineer left a cryptic Slack message at 11pm and went to bed.

We lost 2 hours every Monday rebuilding context.

The Structured Handoff

We built a handoff template that takes 15 minutes to write and saves hours of confusion:

# On-Call Handoff: [DATE] → [DATE]
## Outgoing: @engineer_a | Incoming: @engineer_b

### Active Issues
| Issue | Status | Next Step | ETA |
|-------|--------|-----------|-----|
| DB replication lag | Monitoring | Auto-resolves if < 5s | Check at noon |
| Cert expiry api.prod | Fix scheduled | Deploy cert-bot PR #234 | Tuesday AM |

### Incidents This Shift
1. **[P2] Payment timeout spike** — 2024-03-15 02:30 UTC
   - Resolved: Increased connection pool from 2050
   - Post-mortem: Scheduled for Wednesday
   - Lingering risk: Pool size is a band-aid, need connection pooler

### Upcoming Risks
- Major deploy of auth-service v3 on Tuesday
- Black Friday load test on Thursday
- AWS maintenance window Friday 2-6am UTC

### Helpful Context
- The cache-service has been flakyrestart fixes it (known bug, JIRA-456)
- New on-call runbook for search-service is at [link]
- PagerDuty schedule was updatedcheck your shifts

### Metrics to Watch
- DB replication lag: should be < 1s (currently 0.8s)
- Payment success rate: should be > 99.8% (currently 99.7%)
- API error rate: baseline is 0.05% (currently 0.04%)

Automating the Handoff

We automated 80% of this with a bot:

def generate_handoff_report(outgoing_shift_start, outgoing_shift_end):
    report = {
        'incidents': get_incidents(outgoing_shift_start, outgoing_shift_end),
        'active_alerts': get_active_alerts(),
        'recent_deploys': get_deploys(hours=48),
        'upcoming_maintenance': get_maintenance_windows(days=7),
        'slo_status': get_slo_status(),
        'open_tickets': get_oncall_tickets(status='open')
    }

    # Auto-generate summary
    summary = []
    if report['incidents']:
        summary.append(f"{len(report['incidents'])} incidents during shift")
    if report['active_alerts']:
        summary.append(f"{len(report['active_alerts'])} active alerts to monitor")
    if any(slo['budget_remaining'] < 30 for slo in report['slo_status']):
        summary.append("WARNING: SLO budget low for some services")

    return format_handoff(report, summary)

The 15-Minute Handoff Call

The bot generates the report. The humans spend 15 minutes on video:

0-5 min:  Outgoing reviews active issues and incidents
5-10 min: Walk through upcoming risks and context
10-15 min: Incoming asks questions, confirms understanding

Critical rule: The outgoing engineer is NOT released until the incoming engineer says "I'm good."

The Handoff Score

We rate every handoff:

handoff_score:
  report_completed: +1
  call_happened: +1
  all_incidents_documented: +1
  active_issues_listed: +1
  upcoming_risks_noted: +1
  metrics_baseline_included: +1

  max_score: 6
  target: >= 5

We track this weekly. Teams that score consistently above 5 have 60% fewer "lost context" incidents.

Results

MetricBeforeAfter
Monday morning incidents due to lost context3-4/month0-1/month
Time to rebuild context2 hours15 minutes
Incoming on-call confidence (1-5)2.34.6
Escalations due to missing info8/month1/month

The best part: engineers actually look forward to handoffs now because they're quick and useful instead of stressful.

If you want AI-generated on-call handoff reports that capture everything automatically, check out what we're building at Nova AI Ops.


Written by Dr. Samson Tanimawo BSc · MSc · MBA · PhD Founder & CEO, Nova AI Ops. https://novaaiops.com

More from this blog

N

Nova AI Ops Blog — SRE, Observability & Incident Response

58 posts

Honest, practical writing on SRE, observability, and incident response from the team building Nova AI Ops — the AI-native platform replacing 12 monitoring tools. Deep-dives on alert fatigue, runbook automation, post-mortems, Kubernetes, and everything operational engineers actually deal with at 3 AM.