Incident Checklist
Use this when something is broken.
Stabilize
Section titled “Stabilize”- What is the user impact?
- Is data at risk?
- Can we disable the broken path?
- Can we roll back?
- Who needs to know?
Investigate
Section titled “Investigate”- What changed recently?
- Can we reproduce?
- Which boundary is failing?
- What logs prove the failure?
- What metrics changed?
- Is the database healthy?
- Are external services healthy?
- What is the smallest safe mitigation?
- What is the root cause fix?
- What test or check proves it?
- What should be monitored after?
Postmortem
Section titled “Postmortem”- Timeline.
- Impact.
- Root cause.
- Detection gap.
- Fix.
- Prevention.
- Owner and deadline for follow-ups.