No matter how careful your testing, no matter how complete your review, there is a non-zero probability that releasing your new code will expose a problem. The last phase of risk management is handling a risk that actually manifests.
At the end of the last post, I mentioned the importance of post-release testing and an ability to roll-back changes from the code for mitigating risk. This covers a particular set of circumstances:
- an easy way to detect the problem
- quick ability to turn off or back out changes
- problem occurs a short time after the deploy
- problem is not catastrophic
If the event is covered by this set of circumstances, and you catch the problem in time, and you back out the problem code, the impact is pretty small.
What happens if these conditions aren’t met? Sooner or later every system has an incident. Either we did not protect against the risks we knew of well enough, a risk we hadn’t considered bit us, or a risk we had thought of turned out to have more impact. However it happened, the system is down or degraded to the point that customers/users notice. Now what do you do?
Part of risk mitigation is intelligently dealing with an incident. To do that, you need a plan. Dealing with an incident happens in three parts:
- restore functionality
- analyze the incident
- prepare for the next incident
While dealing with the incident, we need to troubleshoot the symptoms and restore functionality quickly. On the other hand, we need to preserve information that can be used to analyze the incident as much as possible (this might require saving logs from systems before we rebuild them, or taking snapshots of critical information before restarting the system). It can be useful to have someone who is not actually working on the recovery to take notes so that we have a record of what was done to recover. We will also need a system for communicating among all of the people working on the recovery. This could be everyone in one room, a chat session that everyone uses to communicate, a conference call, or a video conference that everyone joins. The mechanism matters less than the fact that there is one place to update the team.
In the best case, the system is restored by rolling back the last change or restarting a server. In these cases, only one or two people are needed to do the actual work. All information about the problem and the steps we performed to recover should still be carefully documented. It’s also a good idea to make sure that at least one other person is watching as changes are made. Having two people agree to a change makes mistakes in recovery slightly less likely.
Sometimes, no one has any idea why the problem has occurred or which symptoms actually matter. In this case, we may have multiple people investigating different parts of the system or different symptoms in an attempt to find the problem. It’s still a good idea to work in pairs and to keep the whole group apprised of any changes before you make them. In these cases, the notes will be even more important. During the various tests or experiments, we may not be positive which change solved the problem.
Shortly after the functionality is restored, we need to make certain:
- all of the relevant symptoms are documented
- all steps we took to resolve the issue are documented
- all of the artifacts (that we did save) are archived someplace safe
These will be needed for the next stage.
The next stage is as important as restoring functionality, even though people often skip it. Some short time after the incident (within days), someone (preferably several people) needs to take all of the information that gathered during the incident to attempt to determine the cause(s) of the incident. This analysis should focus on what actually went wrong, how we could have prevented it, and how we will prevent it in the future. The goal of this process is to identify things we can do as part of development and deployment to reduce risks in the future.
Depending on your organization, you may do a formal FMEA or RCA. Or you might just have your team examine the evidence they collected during the incident and brainstorm some ways to prevent the incident from happening again.
The analysis from the last section should result in some actions we can take. Some will be procedures we can put in place for better testing or code review. These may take the form of checklists or static analysis tools. We could schedule special training to spot certain kinds of errors. We might improve our testing to reduce the chances of something like this slipping through again.
Another potential area for change would be better monitoring of the production system with alerting to recognize the problems sooner. We might increase logging to allow us to troubleshoot more quickly.
The final area for actions would be the development of procedures and scripts that can be used to recover from a problem like this one as rapidly and safely as possible.
As long as there is change in a system or its environment, there will be risks. Some of those risks will result in an incident. We need to strive to learn from each incident to prepare for similar risks in the future.