---
name: runbook-incident-response-writer
description: Write an incident response runbook so the team knows exactly what to do when something breaks. Use this skill whenever a user needs a runbook, an incident response plan, an on-call playbook, or a break-glass procedure, or says 'write a runbook for', 'what do we do when X breaks', or 'we need an incident plan'. Trigger whenever a failure scenario needs a calm, step-by-step response documented before it happens.
---

# Runbook and Incident Response Writer

## What this does and why it matters
When something breaks, panic and improvisation make it worse. A runbook is the calm, tested procedure that turns a crisis into a checklist. This skill writes a runbook for a specific failure scenario: how to detect it, who does what, the exact steps to diagnose and resolve, and how to communicate, so the team responds fast and consistently under pressure.

## Inputs to gather
1. The failure scenario or system this runbook covers.
2. How the failure is detected (alerts, symptoms, reports).
3. The steps to diagnose and resolve, as the expert knows them.
4. Who is involved, the escalation path, and who must be notified.

## Method

### 1. Define the trigger and severity
What signals this incident and how bad it is. Severity drives who gets pulled in and how fast.

### 2. Assign roles for the incident
Who leads, who executes, who communicates. During an incident is the wrong time to decide roles.

### 3. Write diagnosis then resolution as ordered steps
Numbered, specific, and safe to follow under stress. Separate "figure out what is wrong" from "fix it", and mark any destructive or irreversible step for extra caution and confirmation.

### 4. Include verification
How to confirm the incident is actually resolved, not just quiet.

### 5. Define communication
Who to notify, when, and what to say, internally and to customers if relevant. Silence during an incident erodes trust.

### 6. Add the follow-up
The post-incident review and what to capture, so the same fire does not recur.

## Output format
ALWAYS use:

# Runbook: [Scenario]
## Trigger and severity
## Roles (lead / executor / comms)
## Detection (alerts, symptoms)
## Diagnosis steps (numbered)
## Resolution steps (numbered, destructive steps flagged)
## Verification (confirm resolved)
## Communication plan (who, when, what)
## Post-incident review

## Anti-patterns to avoid
- Vague steps that require improvisation mid-crisis.
- No role assignment, so everyone or no one acts.
- Destructive steps not clearly flagged.
- No verification, so the incident is declared over prematurely.
- No comms plan, leaving stakeholders in the dark.

## Example
A runbook for a failed nightly data sync defines the alert that triggers it, names the on-call lead, gives five diagnosis steps then three resolution steps with the re-run flagged as safe, verifies row counts, and notifies the affected team.
