SREIncident ResponsePostmortemDevOpsReliability

Blameless Postmortem: The Google SRE Template (With Examples)

How to write a blameless postmortem after an incident — Google SRE format, real examples, root cause analysis, action items. Free template + AI tool.

Nitish YadavMay 16, 2026

Every engineering team has a moment where production breaks. The interesting question is what happens in the 48 hours after. Teams that write good postmortems get faster and more reliable. Teams that point fingers get the same incident again in three months with different names.

This guide explains how to write a blameless postmortem, the Google SRE template that originated the format, real examples for the sections people get wrong (root cause, action items), and how to turn raw incident notes into a polished document fast.

What is a blameless postmortem?

A blameless postmortem is an incident report that focuses on understanding what failed in the system — code, processes, monitoring, communication — without identifying or blaming individuals. The format was formalized by Google's SRE practice in their public SRE Book and adopted across the industry (Netflix, Stripe, Cloudflare, GitHub) because teams that name systemic causes ship faster, while teams that name individuals ship slower (engineers become risk-averse and hide problems).

Practically, "blameless" doesn't mean "no accountability." It means the report describes systems, not people:

❌ "Adi pushed a migration that locked the users table"
✅ "The migration system allowed a long-running ALTER TABLE without lock timeout, blocking writes for 38 minutes"

The first version teaches the next on-call engineer to be more careful. The second tells the team to add a lock timeout to the migration runner — which actually prevents the next incident.

What does a Google SRE postmortem look like?

The original Google SRE postmortem template has eight standard sections. Every section serves a specific purpose; skipping sections is how postmortems become hand-waving instead of learning.

1. Summary

One paragraph. What happened, how long, how bad. Written for an executive who's never going to read past this paragraph.

2. Impact

Numbers. Who was affected, how many, for how long, what severity. The discipline of writing specific impact numbers ("38% of API requests returned 5xx for 38 minutes, affecting ~12,000 users") forces you to actually measure — and surfaces gaps in your monitoring when you can't.

3. Timeline

Action-by-action with timestamps. Always UTC. Add a relative T+Nm notation alongside absolute times so readers can quickly see "this dragged on for an hour".

Example:

14:32 UTC (T+0)   pgbouncer starts returning "too many connections"
14:33 UTC (T+1m)  PagerDuty fires; on-call ack
14:38 UTC (T+6m)  on-call confirms database connection saturation
14:45 UTC (T+13m) discovered migration script holding open transaction
14:51 UTC (T+19m) migration killed
14:53 UTC (T+21m) pgbouncer restarted
15:10 UTC (T+38m) error rate returns to baseline

4. Root cause

Use the "5 whys" technique to dig past the proximate cause to the systemic cause.

Why did the API return 5xx?
→ pgbouncer ran out of connections.

Why did pgbouncer run out of connections?
→ A migration held an exclusive transaction for 12 minutes.

Why did the migration hold the transaction that long?
→ The migration ran an ALTER TABLE on a 50M-row table without a lock timeout.

Why did it run without a lock timeout?
→ Our migration template doesn't include lock_timeout / statement_timeout.

Why doesn't our migration template include them?
→ We never had a runbook for risky migrations after the Q1 incident.

The fifth "why" is the systemic root cause. That's where action items aim.

5. What went well

Skip this and your postmortem reads like a hit piece. Naming what worked — fast detection, clear comms, good runbook — reinforces those behaviors next time.

6. What went poorly

Specific failure modes in the response: late detection, missing runbook, unclear ownership, broken alerts, communication gaps. Be concrete.

7. Where we got lucky

The most uncomfortable section, and the most valuable. "We got lucky that the incident happened at 6pm UTC when traffic was low — at 1pm UTC the impact would have been 4x." This surfaces hidden risks that would have made things worse.

8. Action items

The output of the whole exercise. Each item needs:

A specific deliverable ("Add lock_timeout = '5s' to the migration runner")
A named owner ("[owner: @adi]")
A due date ("by 2026-06-01")
A priority (P0 / P1 / P2)

Vague action items are useless. "Improve migration safety" is not an action item — it's a wish. "Add lock_timeout = '5s' to the migration runner by 2026-06-01, owned by @adi" is an action item.

A real example (anonymized)

Here's an abbreviated blameless postmortem in the standard format:

Incident: API 5xx spike on 2026-05-14

Summary: A database migration on the production users table caused pgbouncer connection saturation, returning 5xx responses for ~38% of API traffic over 38 minutes (14:32 — 15:10 UTC).

Impact:

38% of API requests returned 5xx

~12,000 affected users (based on unique authenticated requests)

Duration: 38 minutes

Severity: SEV-2 (significant but recoverable; no data loss)

Customer support tickets: 7 opened, all resolved without escalation

Timeline:

14:32 UTC (T+0) — Migration started via standard deploy pipeline

14:32 UTC (T+0) — pgbouncer alerts on "too many connections"

14:33 UTC (T+1m) — PagerDuty fires, on-call ack

14:38 UTC (T+6m) — On-call identifies DB connection pool exhaustion

14:45 UTC (T+13m) — Migration identified as the cause; held an exclusive transaction

14:51 UTC (T+19m) — Migration killed

14:53 UTC (T+21m) — pgbouncer restarted; connections free up

15:10 UTC (T+38m) — Error rate returns to baseline

Root cause: The migration ran an ALTER TABLE users ADD COLUMN ... NOT NULL DEFAULT '...' on a 50M-row table without a lock timeout. The default lock blocked all writes for 12 minutes; subsequent reads timed out after their connections were held waiting.

What went well:

PagerDuty fired within 60 seconds of the first failed request

The on-call team coordinated via Slack incident channel; comms were clear

The migration was killable without data corruption (we'd added save points)

What went poorly:

Our migration template doesn't set lock_timeout or statement_timeout

We had no runbook for risky migrations on large tables

The first 6 minutes were spent confirming the symptom rather than identifying the cause — our DB-side monitoring should have surfaced the long transaction faster

Where we got lucky:

Incident happened at 14:32 UTC (early afternoon EU, morning US East) — traffic was 40% of peak. Same incident during peak would have hit 30,000+ users.

The migration was killable. A non-atomic migration in the same scenario could have left the schema in a bad state.

Action items:

Add lock_timeout = '5s' and statement_timeout = '30s' to the migration runner config — owner: @adi, due: 2026-06-01, priority: P0

Write a "risky migrations on >1M-row tables" runbook in docs — owner: @priya, due: 2026-06-15, priority: P1

Set up alerts on long-running transactions (> 30s) — owner: @rohan, due: 2026-06-01, priority: P1

Add monitoring dashboard for migration progress and connection-pool health — owner: @rohan, due: 2026-07-01, priority: P2

How to keep postmortems blameless

Three rules. Write them on a sticky note above your desk.

Rule 1: Replace names with systems. Every time you write "X did Y", reframe to "The Y happened" or "The system allowed Y". The pattern is:

❌ Name-attributed	✅ System-attributed
"Adi merged the broken PR"	"The PR landed without a passing test for this case"
"Priya forgot to update the runbook"	"The runbook hadn't been updated since the previous incident"
"On-call didn't notice the alert"	"The alert was buried under N other notifications"

Rule 2: Trust people, fix systems. The on-call engineer who handled the incident did the best they could with the information, tools, and runbooks they had. If the response was slow, the system gave them a slow response — fix the system.

Rule 3: Publish the postmortem to the whole team. Locking postmortems to incident responders teaches no one. Locking them to managers makes people perform-not-learn. Publish to engineering at minimum; publish externally for material customer-facing incidents (Cloudflare, Stripe, Vercel all do this and it builds enormous trust).

How to write postmortems faster

Postmortems are notorious for taking weeks to produce because the responder is exhausted, the details are fading, and writing the doc feels like the lowest-priority thing in the world. Three patterns that work:

1. Capture during the incident, not after. A senior engineer or incident commander whose job is only to write the running timeline in a shared doc. The doc becomes the postmortem skeleton. Memory degrades 50% in 24 hours; capture is 10x easier in real-time than after.

2. Use a template. Every section above is in the template. Filling in a template is faster than blank-page writing.

3. Let AI draft from your notes. Our free AI Postmortem Generator takes your raw incident notes (timeline, impact, what we know) and produces a polished blameless postmortem in the Google SRE format, with action items table, root cause section, and the "what went well / poorly / lucky" sections all filled in. Edit from there — it gets you 80% of the way in 60 seconds.

What's the difference between a postmortem, a retrospective, and an incident report?

These three terms get mixed up. Quick disambiguation:

Incident report — usually customer-facing, written soon after an incident, focused on impact and status. Light on root cause analysis.
Postmortem — internal-focused, detailed, written after the incident is fully resolved, focused on root cause and action items. The Google SRE term.
Retrospective — team process review, not tied to a specific incident. Looks at "what's been going well / poorly in our process over the last sprint or quarter".

A material customer-facing incident often produces all three: a public incident report (status page), an internal postmortem (engineering wiki), and feeds the next retrospective.

FAQ

Should we name individuals in postmortems at all?

For the response, yes — assigning action item ownership requires names. For the cause, no — describe what the system allowed, not who did it. Owners on action items are different from blame on root causes.

How long should a postmortem be?

Depends on severity. SEV-1 (extended outage, data loss potential): 2-4 pages, fully fleshed-out. SEV-2 (significant but contained): 1-2 pages. SEV-3 (minor degradation): one paragraph in a shared doc may be enough.

Who should write the postmortem?

The incident commander or designated "scribe" — not necessarily the person who fixed the issue. Writing the doc and fixing the issue are different skills and the person who debugged at 2am is too exhausted to write well at 9am.

When should the postmortem be done by?

72-hour rule: aim to publish within 3 days of the incident closing. Beyond a week and details fade; the team also loses momentum on action items.

Do small companies need postmortems?

A 3-person startup doesn't need a 4-page document. But the discipline of asking "what happened, why, what do we change?" applies even to a slack message. Many founders run an informal version of this for every customer escalation.

What's a "5 whys" analysis?

A root cause analysis technique where you ask "why?" five times in succession, each time digging deeper from the surface symptom to the systemic cause. The fifth "why" usually surfaces the policy, process, or tooling gap that should be the target of an action item.

Should postmortems be public?

For customer-facing incidents at companies with technical users (Cloudflare, GitHub, Stripe, Vercel), yes. Public postmortems build customer trust by demonstrating the team understands the incident and is fixing it. For internal infrastructure issues, no — keep them internal.

How do I track action items from postmortems?

Tag them with a postmortem label in your project tracker. Review the backlog of postmortem action items every 2 weeks. If you're not closing them, you're producing reports instead of preventing incidents.

TL;DR

A blameless postmortem describes what failed in the system, not who did it
The Google SRE format has 8 sections: Summary, Impact, Timeline (UTC + T+N), Root Cause, What Went Well, What Went Poorly, Where We Got Lucky, Action Items
Replace names with systems in the root cause and what-went-poorly sections
Action items must be specific, owned, dated, and priority-tagged
Use the free AI Postmortem Generator to convert raw incident notes into a polished SRE-format postmortem in 60 seconds

Pair this practice with the AI Standup Update Generator for daily comms hygiene, and the AI OKR Generator for quarterly planning. The same principles — outcome focus, specificity, named ownership — apply across all three.

See how we compare

vs Chatbase vs Intercom vs Tidio vs Drift vs Crisp

Ready to try InsiteChat?

Create an AI chatbot trained on your website in minutes.

Get started free