Commentary: How to write a post-mortem that always blames Terry
or Steve or Sally or Rebecca or that stupid mascot dog you have on our website
First of all, I apologize if your name is Terry. How to write a post-mortem that always blames Terry was not intended to hurt any Terrys, but we didn’t think of the dark underbelly of our work, and for that, we apologize.
Now that this is out of the way, this post covers a list of anti-patterns related to poor incident management in which no post-mortem is done and other harmful human factors.
The Anti-Pattern Recipe
This one has some spicy peopleware flavors:
A post-mortem is done, but not done well:
The 5 Whys are used poorly: at the wrong depth and by the wrong people.
Individuals are blamed for things, the opposite of “blameless post mortems”.
The processes and organizational factors are not considered because the people in power run those, not our poor Terry.
A single person (root cause) is blamed for the incident.
Bad Five Whys
Google is seen as the shining example of a company that does site reliability better than anyone, and their simple guides are gold mines. You basically shouldn’t be doing post-mortems without having read that link.
Power Matters
Organizational structures resist actions that try to break them down. A post-mortem, if done poorly, becomes a blame game, and those with power can more effectively avoid this by using their power. They run these meetings, for example, and decide who comes to them and what is written down and what isn’t. They can even decide if there is a post-mortem.
Ideally, a separate group handles incidents' coordination, tracking, post-mortem creation, and other things that must be done. This group should act almost like a helpful watchdog to calmly steer towards steady improvement and not allow gamesmanship.
Single Root Cause
Incidents do not have a single root cause; based on this logic, they cannot have a single person to blame.
This is not typically how engineers are trained, but it speaks to behavior I’ve seen before in less extreme cases. When the software industry started to try to use mature engineering post-mortem ideas, the idea of a 5-whys exercise driving to a Single Root Cause came into fashion but hasn’t entirely gone away. In toy examples like the below from the original Toyota concept, a single root cause can be found:
Think of some non-software examples, and you’ll see it: why did you gain 10 lbs? Well, the single root cause is likely you eating more, but it could also be you moving less or having more stress. You might step back and realize it is because of a lifestyle change (you having your first child, you moving to an apartment above a Chipotle, you taking a new medicine, you starting a new stressful project, etc.) In complex systems, there are rarely directly simple answers.
In The Art of Thinking Clearly, chapter 97 is titled “The Stone Age Hunt for Scapegoats Fallacy of the Single Cause” and includes this passage from Tolstoy:
When an apple ripens and falls - what makes it fall? Is it that it is attracted to the ground, is it that the stem withers, is it that the sun has dried it up, that it has grown heavier, that the wind shakes it, that the boy1 standing underneath it want to eat it? No one thing is the cause.
More Complex Root Cause Analysis
A good book on the human factors that prevent mature responses to an incident is Black Box Thinking, which is a great layman’s introduction to how airlines run crash investigations and the cultural obstacles they used to face (and that are currently faced in other industries). I mention this book in Commentary: How to treat programming with illogical seriousness as it draws a strong distinction between how airlines treat failures (a chance to learn) and healthcare (a chance to deny, pick a weak single cause, then run away to play golf).
A more mature approach emerging2 Learning from Incidents, which raises the bar to improve learning and move away from common wrong answers like human error and single root cause.
What Terry should do
Also, Terry should find a new job. And maybe call his mother.
Terry. The boy’s name is Terry.
In our field, it has existed in with other names in process management and physical safety for decades.