Operations

Incident Postmortem

Constructs a structured incident postmortem by analyzing git history, code changes, and error patterns — with timeline, root cause analysis, and action items.

SKILL.md

---
description: Generate a structured incident postmortem from git history and code analysis
allowed-tools: Bash(git *), Read, Glob, Grep, Write
---

# Incident Postmortem Generator

Analyze recent git history, code changes, and error patterns to construct a structured postmortem for an incident.

What happened: $ARGUMENTS

## Steps

1. **Establish the incident timeline** from git history:
   - Run `git log --since="7 days ago" --pretty=format:"%h|%s|%an|%ad" --date=iso` to see recent commits.
   - Run `git log --all --oneline --graph --since="7 days ago"` to visualize branch activity.
   - Identify any recent deployments, merges to main, or hotfix branches.
   - Look for revert commits (`git log --grep="revert" --since="7 days ago"`).

2. **Identify the likely change that caused the incident**:
   - Based on the $ARGUMENTS description, grep the codebase for related code.
   - Run `git log -p --since="7 days ago" -- [relevant files]` to see recent changes to those files.
   - Look for changes to configuration files, environment variables, database schemas, or API contracts.
   - Check for dependency updates (`git diff HEAD~20..HEAD -- *lock* package.json requirements.txt Cargo.toml`).

3. **Analyze the root cause**:
   - Read the files involved in the suspected change.
   - Grep for error handling patterns (or lack thereof) in the affected code.
   - Check for missing validation, unhandled edge cases, or race conditions.
   - Look for related tests — were there tests that should have caught this?
   - Check CI/CD configuration for gaps in the test/deploy pipeline.

4. **Assess the impact**:
   - Grep for monitoring, alerting, or logging related to the affected area.
   - Identify which user-facing features were affected based on the code path.
   - Estimate the blast radius: which routes, APIs, or services depend on the affected code.
   - Check for feature flags that could have limited the impact.

5. **Document the resolution**:
   - Find the fix commit(s) if a fix has been applied.
   - Read the fix diff to understand what was changed.
   - Assess whether the fix is a permanent solution or a temporary mitigation.

6. **Generate the postmortem document**:

   ```markdown
   # Incident Postmortem: [Title]

   **Date**: [Date of incident]
   **Severity**: [Critical / Major / Minor]
   **Duration**: [Estimated from git timestamps]
   **Author**: [Generated by AI]

   ## Summary
   [2-3 sentence description of what happened and the impact]

   ## Timeline
   | Time | Event |
   |------|-------|
   | [timestamp] | [Change deployed / merged] |
   | [timestamp] | [First signs of impact] |
   | [timestamp] | [Incident detected] |
   | [timestamp] | [Investigation began] |
   | [timestamp] | [Root cause identified] |
   | [timestamp] | [Fix deployed] |

   ## Root Cause
   [Detailed technical explanation of why the incident occurred]

   ### Contributing Factors
   - [Factor 1: e.g., missing test coverage for edge case]
   - [Factor 2: e.g., no canary deployment process]
   - [Factor 3: e.g., insufficient monitoring on the affected endpoint]

   ## Impact
   - **Users affected**: [Estimate based on code analysis]
   - **Features affected**: [List from code path analysis]
   - **Data impact**: [Any data corruption or loss, based on code analysis]

   ## Resolution
   [Description of the fix, with reference to the commit]

   ## Lessons Learned
   ### What went well
   - [e.g., Fast detection time, clean rollback]

   ### What went poorly
   - [e.g., No alerting on the affected metric]

   ## Action Items
   | Action | Owner | Priority | Status |
   |--------|-------|----------|--------|
   | [Add tests for the edge case that caused the incident] | [TBD] | High | Open |
   | [Add monitoring/alerting for affected endpoint] | [TBD] | High | Open |
   | [Add integration test to CI pipeline] | [TBD] | Medium | Open |
   | [Document the runbook for this failure mode] | [TBD] | Medium | Open |
   | [Review similar code paths for the same pattern] | [TBD] | Low | Open |
   ```

7. **Write the postmortem** to `POSTMORTEM-[date].md` in the project root.

## Rules

- Blameless by default. Never attribute the incident to an individual — focus on systems and processes.
- Base the timeline on git timestamps and evidence, not guesswork.
- Distinguish between confirmed facts and inferences. Mark inferences with "[Inferred]".
- Action items must be specific and actionable — not vague ("improve testing").
- If the root cause cannot be definitively determined from the code, list the most likely candidates ranked by probability.
- Include action items that address the systemic issue, not just the proximate cause.
- The postmortem should be useful to someone who was not involved in the incident.

How It Works

Incident postmortems are critical for organizational learning, but they are painful to write. The person who fixed the incident at 2 AM is rarely eager to spend the next day documenting what happened. This skill automates the tedious parts — constructing the timeline from git history, identifying the change that caused the incident, and analyzing the code for root causes.

The blameless framing is built into the template. By focusing on systems, processes, and code patterns rather than individuals, the generated postmortem encourages honest analysis. The action items target systemic improvements: adding tests, improving monitoring, documenting runbooks — not "tell Bob to be more careful."

The git history analysis is particularly powerful. By correlating deployment timestamps with the incident description, the skill can often pinpoint the exact commit that introduced the problem. Reading the diff of that commit reveals the root cause, and examining the surrounding test files reveals why it was not caught. This forensic approach produces postmortems that are both thorough and evidence-based.

←Back to Examples