Commentary: How to run a production incident that creates more incidents
Sometimes people stay on the phone because they are lonely, and that's OK
There are many common anti-patterns when it comes to emergency system management and because of its unstructured nature, they can be hard to see. How to run a production incident that creates more incidents covers some important peopleware factors.
The Anti-Pattern Recipe
There are many ways to mishandle the running of a production incident:
Not a lot of upfront information on what is failing, so the response needs to include intelligence-gathering
Too many people involved
Some of those people are emotional
Power imbalances on the team create fear
No clear ownership of system areas or system functions (leading to annoyed people who didn’t expect to be paged, also leading to just too many people involved)
Permissions are unclear, unknown, or locked down to the point that resolution is much harder
Being emotional (yelling) makes it harder to think and pushes people into defending themselves versus tracking down the problem and solving
Trying to solve a problem (fix it for real) gets confused with resolving a problem (fix it for now)
No real follow-up to solve after the resolution, just a penciled-in assignment of ownership, i.e., “When system X breaks, call person Y”
How to truly solve
Of course, doing the opposite of this makes things simpler, even if it is more difficult:
Gather as much information about the problem as possible, and don’t deny facts
Tolerate only people and conversations that move towards restoration of service
Only call people who are expecting to be called in emergency situations
These people should already have access to what they need in order to help resolve
After resolution, real analysis occurs without the time pressure so that a real solution can happen.
The real solution has a clear person assigned
The idea is that every emergency incident should be a surprise. A famous bug or emergency should never happen - as soon as something happens, you should fix it so that it never happens again.
Organizationally, this is more difficult than you might imagine. An organization can shift into two areas - let’s call them Service and Product. They could be called a million things, but the rough responsibilities are:
Service: people that are in charge of keeping things running.
Product: is in charge of building the things that should run.
In your organization, this split might be called Ops and Development, DevOps and AppDev, CloudOps and Those Idiots, or Odds and Ends. Or, in your more enlightened (or smaller) organization, this split might not exist, or it might be a small split with high collaboration. If the split exists, the groups are rewarded differently for production incidents:
Service: get it running again as fast as possible.
Product: have it never happen again.
Service is the hero if they can have a 911 call never happen, so they create scripts to restart services every three days and create playbooks that reboot things. They can tolerate famous bugs for months as long as they know how to work around them. They create things like auto-defibrillators to prevent patients from dying during a heart attack.
Product is the hero, well, not that often. If they really solve the problem they are the hero, but the absence of work is never rewarded like the existence of work. They create things like a more reasonable diet or an exercise program to prevent heart attacks.
If these two groups are not working very tightly together, Service can accidentally hide problems from Product or make them appear less severe so that they are never worked. And you can have a lot of production incidents that are “auto-resolved” without really ever being fixed.
In a small organization, this problem is trivial: put developers on-call. In larger organizations, you need someone very experienced with all the systems to narrow down a problem, and you can risk burning this group out by having them always on call. An on-call rotation has to occur, along with cross-training and better documentation.
Caveats and It-Depends and Common Objections
Common objections to the original post:
in an emergency, you can tolerate more people than needed because it is an emergency
you can’t have clear lines of responsibility when your organization is small
sometimes it is better to stay on the phone and Fix it For Real and not just Fix it for Now
More discussion on doing a real post-mortem in Additional Commentary: How to write a post-mortem that always blames Terry