A production incident is when software behaves in a way that frightens or upsets customers and needs to be resolved before you get to eat again. For it to qualify as an emergency, it needs to be causing harm to customers, actively losing money, happening during an important demo, or really bothering that VP. How you handle these incidents can define your company culture.
The first thing you need to do is get everyone together on the phone. Who should attend? As many people as possible, with at minimum one person of each of these core types:
Reporter: A person who reported the problem, likely by having a customer yell at them, who impatiently waits their turn, then yells in a nicer way about how bad the problem is.
Keymaster: A person who has permission to open a door that is needed to see the problem or solve the problem, who is not anywhere near a computer.
Fixer: A person who can fix the problem, who just joined five minutes ago and doesn’t know what is happening.
These are the core players. Spice up the recipe by adding additional flavors:
A person who can fire everyone on the phone, who is required to ask questions once every ten minutes of the form, “Why can’t we just retry it?” or “Why is this taking so long?”, and express frustration as much as possible, but in a professional way.
A person on PTO who is attending the Loudest Sounds Ever Made music festival but is trying to conceal it.
A person who feels that they have caused the problem and is looking to control the narrative so that this isn’t widely known.
A person currently running a daycare for children in their off hours, and the kids are in the middle of a yelling contest.
A person who is not sure why they are there but feels like they need to contribute, so they serve as an autocomplete feature for the less talkative members of the call, chiming in to restate or complete obvious things that have been said. “We should fix this so that we can restore service to our customers”, “If we can see the system logs, we might find out what is happening”, “We should add the right people to the call if needed”, etc.
Once you gather the right people on the phone you can execute the phases of the production incident:
Phase 1
With each new caller that joins, the one-sentence description of the incident is read out loud: “Nothing is working. I’m mad about it.”
Phase 2
Keep paging people until the sentence makes sense to someone. Let’s call this person The Fool.
Once the Fool is identified, let 30 - 50% of the people on the call leave.
Phase 3
5 - 15 minutes of silence while the Fool goes through an express version of the five stages of grief, feels ownership of the problem, and starts investigating.
This person then tells you who else to page, a combination of The Fool’s Team and a series of Keymasters that are needed to open doors.
Phase 4
A solid attempt to solve the problem is made for another 20 minutes until someone mentions that they just need to resolve the problem, not solve it.
A service, server, or expectation is reset which stops the emergency.
Phase 5
Everyone says “thanks” to ensure everyone knows they were there and hangs up.
Phase 6
The next day nobody follows up because ownership has not been assigned. The only visible learning is that the Fool has been identified, so they can be paged more quickly the next time this issue happens. This information is written down in something called a playbook.
These phases are not strictly defined; during any of them, all others can discuss the weather, who is at fault, how to tell the customer, and yell at each other.
Paid subscribers view additional commentary about this post, with links to useful resources on handling production incidents well. If you are reading this and thinking, “well it depends,” or “this is a gross oversimplification,” or “what an idiot,” then you might be interested in Additional Commentary for this post.
This one is especially relatable and funny.
These lines made me laugh: "A person on PTO who is attending the Loudest Sounds Ever Made music festival but is trying to conceal it."
"A person currently running a daycare for children in their off hours, and the kids are in the middle of a yelling contest."
"Keep paging people until the sentence makes sense to someone. Let’s call this person The Fool."
"We should add the right people to the call"