GroveAI
BusinessFree Template

AI Post-Mortem Template

A blameless post-mortem template designed for AI system incidents. Covers incident timeline, root cause analysis, impact assessment, corrective actions, and lessons learned. Helps teams learn from AI failures and build more resilient systems.

Overview

What's included

Incident summary and severity classification
Detailed timeline of events
Root cause analysis framework (5 Whys)
Impact assessment across multiple dimensions
Corrective actions with owners and deadlines
Lessons learned and knowledge sharing template
1

Incident Summary

AI Incident Post-Mortem

Incident ID:   Incident name:   Date of incident:   Post-mortem date:   Post-mortem facilitator:   Post-mortem attendees:  

Severity

  • Critical — Service outage or major data/safety incident affecting customers
  • High — Significant quality degradation or security concern
  • Medium — Noticeable impact on a subset of users
  • Low — Minor issue caught before significant user impact

Summary

In 2-3 sentences, what happened?



Key Metrics

MetricValue
Time to detect  minutes/hours
Time to acknowledge  minutes/hours
Time to mitigate  minutes/hours
Time to resolve  minutes/hours
Total duration  minutes/hours
Users/requests affected 
Revenue impact£ 
SLA breached?Yes / No
2

Timeline & Root Cause Analysis

Timeline of Events

Time (UTC)EventActorNotes
 First occurrence of the issueSystem 
 Alert triggered / issue detectedMonitoring / User report 
 Incident acknowledged by on-call  
 Initial diagnosis  
 Mitigation applied  
 Service restored / issue resolved  
 Root cause confirmed  
 Post-mortem scheduled  

Root Cause Analysis (5 Whys)

Problem:  

  1. Why?  
  2. Why?  
  3. Why?  
  4. Why?  
  5. Why?  

Root cause:  

Contributing Factors

FactorCategoryDescription
 Technical / Process / Human 
 Technical / Process / Human 
 Technical / Process / Human 

AI-Specific Root Causes (Check if applicable)

  • Model drift — model performance degraded over time
  • Data quality — input data quality changed
  • Prompt change — a prompt modification caused unexpected behaviour
  • Model update — vendor updated the underlying model
  • Adversarial input — malicious or unexpected user input
  • Hallucination — model generated false or misleading output
  • Rate limiting — AI service rate limits caused failures
  • Context window — input exceeded model context limits
  • Integration failure — API or integration issue
  • Infrastructure — compute, network, or storage failure
3

Corrective Actions & Lessons Learned

Impact Assessment

DimensionImpactDetail
Customer impactHigh/Medium/Low/None 
Revenue impact£  
Reputational impactHigh/Medium/Low/None 
Data/security impactHigh/Medium/Low/None 
Compliance impactHigh/Medium/Low/None 

Corrective Actions

Immediate (Within 1 Week)

#ActionOwnerDeadlineStatus
1   Open
2   Open

Short-Term (Within 1 Month)

#ActionOwnerDeadlineStatus
3   Open
4   Open

Long-Term (Within 1 Quarter)

#ActionOwnerDeadlineStatus
5   Open
6   Open

Lessons Learned

What went well?




What could be improved?




What will we do differently next time?




Blameless Culture Reminder

This post-mortem is conducted in a blameless manner. We focus on systemic improvements, not individual blame. The goal is to make our systems and processes more resilient, not to assign fault.

Instructions

How to use this template

1

Conduct the post-mortem within 48 hours

Hold the post-mortem while details are fresh. Invite everyone who was involved in detecting, diagnosing, and resolving the incident.

2

Build the timeline first

Reconstruct what happened chronologically before analysing why. Use logs, alerts, and team recollections to build an accurate timeline.

3

Use 5 Whys to find root causes

Keep asking 'why' until you reach a systemic root cause, not a human error. The root cause should suggest a process or system improvement.

4

Assign concrete corrective actions

Every action needs an owner and deadline. Track actions to completion in subsequent weeks. Unfinished actions from post-mortems erode trust in the process.

Watch Out

Common mistakes to avoid

Blaming individuals instead of systems — focus on what process or safeguard failed, not who made the mistake.
Not following up on corrective actions — track actions to completion; unfinished actions mean the incident can recur.
Only conducting post-mortems for major incidents — medium-severity incidents often reveal systemic issues worth addressing.
Not sharing learnings widely — publish post-mortems (with appropriate redaction) so the whole engineering team benefits.

FAQ

Frequently asked questions

Conduct a post-mortem for all Critical and High severity incidents. Medium severity incidents should have a lightweight review. Low severity incidents can be tracked in a log without a full post-mortem.

Include everyone involved in the incident: the on-call engineer, the person who detected the issue, anyone who contributed to resolution, and the system owner. For high-severity incidents, include a leadership observer.

A blameless post-mortem focuses on systems and processes rather than individual actions. It assumes people made reasonable decisions based on the information they had at the time. The language should be 'what failed' not 'who failed'.

Add corrective actions as tickets in your project management tool (Jira, Linear, etc.) with clear owners and deadlines. Review progress in weekly team meetings. Report completion status in monthly governance reviews.

For significant customer-facing incidents, publish a summary (what happened, what we did, how we are preventing recurrence) as part of your incident communication. Full internal post-mortems contain too much detail for external sharing.

Need a custom AI template?

Our team can build tailored templates for your specific business needs. Book a free strategy call.