Skip to main content

Command Palette

Search for a command to run...

The Incident Commander Role: Running Incidents Without Chaos

Published
3 min read
The Incident Commander Role: Running Incidents Without Chaos
S
Building Nova AI Ops — The AI-native platform replacing 12 monitoring tools. Deep-dives on alert fatigue, runbook automation, post-mortems, Kubernetes, and everything operational engineers actually deal with at 3 AM.

Everyone's Debugging, Nobody's Leading

Five engineers in an incident channel. All debugging independently. Nobody coordinating. Three people checking the same dashboard. Two trying conflicting fixes. Customers waiting.

This is what incidents look like without an Incident Commander.

What the IC Does

The IC doesn't debug. They coordinate.

IC Responsibilities:
✓ Declare incident severity
✓ Assign roles (debugger, communicator, scribe)
✓ Coordinate investigation streams
✓ Make decisions (rollback? escalate? wait?)
✓ Manage communication (status page, stakeholders)
✓ Call for help when needed
✓ Declare all-clear

IC Does NOT:
✗ Write code
✗ Run queries
✗ SSH into servers
✗ Debug the issue

The IC Playbook

Minute 0-5: Declaration

1. Acknowledge the page
2. Open incident channel: #inc-YYYY-MM-DD-description
3. Post severity declaration:

"I'm IC for this incident.
Severity: P1 - Customer-facing checkout is down
Impact: ~30% of checkout attempts failing

Roles:
- @alice: Primary debugger
- @bob: Comms (status page + Slack updates)
- @charlie: Scribe (timeline)

First actions:
- @alice: Check last deploy and error logs
- @bob: Post initial status page update
- I'll update every 10 minutes."

Minute 5-15: Investigation

The IC runs a structured investigation loop:

Every 5 minutes:
1. "@alice, what have you found?"
2. Synthesize information
3. Decide next action
4. Assign next task
5. Update channel: "Current theory: [X]. Testing: [Y]."

Minute 15+: Decision Points

def ic_decision_tree(situation):
if situation.root_cause_known:
if situation.fix_available:
return "Deploy fix with canary"
else:
return "Rollback to last known good"

if situation.duration > 15 and not situation.making_progress:
return "Escalate: bring in additional expertise"

if situation.customer_impact_growing:
return "Escalate severity + enable fallback"

return "Continue investigation, update in 5 min"

Communication Templates

Pre-written templates save precious minutes:

templates:
internal_update:
format: |
**Incident Update [{severity}] {time} UTC**
Status: {investigating|identified|monitoring|resolved}
Impact: {impact_description}
Current action: {what_we_are_doing}
Next update: {time_of_next_update}

status_page_update:
format: |
We are {status} an issue affecting {service}.
Some users may experience {symptom}.
Our team is actively working on a resolution.
Next update in {minutes} minutes.

executive_escalation:
format: |
P1 Incident: {title}
Duration: {duration} minutes
Customer impact: {impact}
Revenue impact: ~${revenue}/hour
Current status: {status}
ETA to resolution: {eta}

Training New ICs

We use game days to train ICs:

Week 1: Shadow an experienced IC during a game day
Week 2: IC a simulated P2 incident (game day)
Week 3: IC a simulated P1 incident (game day)
Week 4: IC a real P3/P4 incident with a mentor observing
Week 5+: IC rotation for all severities

The IC Rotation

ic_rotation:
schedule: weekly
pool_size: 6 # Minimum for sustainable rotation
requirements:
- Completed IC training program
- At least 6 months on the team
- Shadowed 3+ real incidents
compensation:
- Same as on-call compensation
- IC counts as on-call time

Before and After

MetricWithout ICWith IC
MTTR (P1)67 min28 min
Communication gapsFrequentRare
Duplicate work~40%~5%
Stakeholder satisfactionLowHigh
Post-mortem qualityIncompleteThorough

The IC doesn't make incidents shorter because they're smarter. They make incidents shorter because someone is actually managing the response.

If you want AI-assisted incident coordination that makes every engineer an effective IC, check out what we're building at Nova AI Ops.


Written by Dr. Samson Tanimawo BSc · MSc · MBA · PhD Founder & CEO, Nova AI Ops. https://novaaiops.com

More from this blog

N

Nova AI Ops Blog — SRE, Observability & Incident Response

58 posts

Honest, practical writing on SRE, observability, and incident response from the team building Nova AI Ops — the AI-native platform replacing 12 monitoring tools. Deep-dives on alert fatigue, runbook automation, post-mortems, Kubernetes, and everything operational engineers actually deal with at 3 AM.