AI Post-Mortem Template
A blameless post-mortem template designed for AI system incidents. Covers incident timeline, root cause analysis, impact assessment, corrective actions, and lessons learned. Helps teams learn from AI failures and build more resilient systems.
Overview
What's included
Incident Summary
AI Incident Post-Mortem
Incident ID: Incident name: Date of incident: Post-mortem date: Post-mortem facilitator: Post-mortem attendees:
Severity
- Critical — Service outage or major data/safety incident affecting customers
- High — Significant quality degradation or security concern
- Medium — Noticeable impact on a subset of users
- Low — Minor issue caught before significant user impact
Summary
In 2-3 sentences, what happened?
Key Metrics
| Metric | Value |
|---|---|
| Time to detect | minutes/hours |
| Time to acknowledge | minutes/hours |
| Time to mitigate | minutes/hours |
| Time to resolve | minutes/hours |
| Total duration | minutes/hours |
| Users/requests affected | |
| Revenue impact | £ |
| SLA breached? | Yes / No |
Timeline & Root Cause Analysis
Timeline of Events
| Time (UTC) | Event | Actor | Notes |
|---|---|---|---|
| First occurrence of the issue | System | ||
| Alert triggered / issue detected | Monitoring / User report | ||
| Incident acknowledged by on-call | |||
| Initial diagnosis | |||
| Mitigation applied | |||
| Service restored / issue resolved | |||
| Root cause confirmed | |||
| Post-mortem scheduled |
Root Cause Analysis (5 Whys)
Problem:
- Why?
- Why?
- Why?
- Why?
- Why?
Root cause:
Contributing Factors
| Factor | Category | Description |
|---|---|---|
| Technical / Process / Human | ||
| Technical / Process / Human | ||
| Technical / Process / Human |
AI-Specific Root Causes (Check if applicable)
- Model drift — model performance degraded over time
- Data quality — input data quality changed
- Prompt change — a prompt modification caused unexpected behaviour
- Model update — vendor updated the underlying model
- Adversarial input — malicious or unexpected user input
- Hallucination — model generated false or misleading output
- Rate limiting — AI service rate limits caused failures
- Context window — input exceeded model context limits
- Integration failure — API or integration issue
- Infrastructure — compute, network, or storage failure
Corrective Actions & Lessons Learned
Impact Assessment
| Dimension | Impact | Detail |
|---|---|---|
| Customer impact | High/Medium/Low/None | |
| Revenue impact | £ | |
| Reputational impact | High/Medium/Low/None | |
| Data/security impact | High/Medium/Low/None | |
| Compliance impact | High/Medium/Low/None |
Corrective Actions
Immediate (Within 1 Week)
| # | Action | Owner | Deadline | Status |
|---|---|---|---|---|
| 1 | Open | |||
| 2 | Open |
Short-Term (Within 1 Month)
| # | Action | Owner | Deadline | Status |
|---|---|---|---|---|
| 3 | Open | |||
| 4 | Open |
Long-Term (Within 1 Quarter)
| # | Action | Owner | Deadline | Status |
|---|---|---|---|---|
| 5 | Open | |||
| 6 | Open |
Lessons Learned
What went well?
What could be improved?
What will we do differently next time?
Blameless Culture Reminder
This post-mortem is conducted in a blameless manner. We focus on systemic improvements, not individual blame. The goal is to make our systems and processes more resilient, not to assign fault.
Instructions
How to use this template
Conduct the post-mortem within 48 hours
Hold the post-mortem while details are fresh. Invite everyone who was involved in detecting, diagnosing, and resolving the incident.
Build the timeline first
Reconstruct what happened chronologically before analysing why. Use logs, alerts, and team recollections to build an accurate timeline.
Use 5 Whys to find root causes
Keep asking 'why' until you reach a systemic root cause, not a human error. The root cause should suggest a process or system improvement.
Assign concrete corrective actions
Every action needs an owner and deadline. Track actions to completion in subsequent weeks. Unfinished actions from post-mortems erode trust in the process.
Watch Out
Common mistakes to avoid
FAQ
Frequently asked questions
Conduct a post-mortem for all Critical and High severity incidents. Medium severity incidents should have a lightweight review. Low severity incidents can be tracked in a log without a full post-mortem.
Include everyone involved in the incident: the on-call engineer, the person who detected the issue, anyone who contributed to resolution, and the system owner. For high-severity incidents, include a leadership observer.
A blameless post-mortem focuses on systems and processes rather than individual actions. It assumes people made reasonable decisions based on the information they had at the time. The language should be 'what failed' not 'who failed'.
Add corrective actions as tickets in your project management tool (Jira, Linear, etc.) with clear owners and deadlines. Review progress in weekly team meetings. Report completion status in monthly governance reviews.
For significant customer-facing incidents, publish a summary (what happened, what we did, how we are preventing recurrence) as part of your incident communication. Full internal post-mortems contain too much detail for external sharing.
Need a custom AI template?
Our team can build tailored templates for your specific business needs. Book a free strategy call.