Mitigating Risk

By | August 7, 2018

You’ve carefully assessed the risks in your new system. You’ve considered the implementation and eliminated bad implementation decisions and removed unnecessary features that involved extra risk. Considering the result, you note that there is still risk. So, what do you do? You try to mitigate the remaining risk.

Risk mitigation does not remove the risk itself, but attempts to:

  • Reduce the likelihood of the risk occurring
  • Reduce the impact of the risk, if it happens

If you think back to the risk assessment post, these were the two main aspects of the quantitative risk assessment. Looking at risk in this way gives a framework for thinking about mitigating risk.

Reduce Likelihood of Occurrence

One way to reduce risk is simply to reduce the chance that a particular problem will occur. The two approaches people normally use to reduce the likelihood of a bug occurring are:

  • try to detect bugs as quickly as possible
  • try to prevent putting them in

Since everyone makes mistakes, you need some system to catch mistakes before they go live. There have been a number of approaches for having people check each other’s code. We began with code review. Early approaches were very formal and heavy. Only very large development groups could manage one. More light-weight versions of peer review followed. Since programmers quickly noticed that some mistakes recurred quite a bit, they soon began to build Code Analysis tools looking for dangerous or unwise practices.

The eXtreme Programming (XP) methodology first suggested turning development best practices up to eleven. In the original books, code review was turned continuous by using Pair Programming. Testing your code was made extreme by using Test-Driven Development (TDD). Later, it became clear that TDD more of a design approach than a testing approach. All of these practices provide ways to reduce the number of bugs that make it through the development process.

One fairly old practice was manual testing of the code. This was often effective, but it is hard to reproduce. It also becomes less useful for multiple versions of software, because people don’t do exactly the same steps each time. Obviously, the solution to repeatable tests is to automated tests. This allows us to be sure that changes to the code do not break old functionality. Once you have an effective set of automated tests, it becomes tempting to try to run them all the time (or at least every time code is committed to the repository). This became the practice of Continuous Integration (CI).

Over time, we have found that all of these practices contribute to overall code quality, and none completely overlaps the others.

Reduce Impact

There are several ways that you can reduce the impact of risky changes. Making smaller changes where possible can (under some circumstances) reduce the impact of this part of the change (even if the whole change still has the same risk). Staged deploy/release of code can allow us to test the change in a near-production environment, catching problems before they affect production. You can also use a partial release or A/B testing to expose a limited number of users to the change. With a robust system for rolling back changes (either through feature flags or a blue/green release), we can quickly back out a change as soon as we detect a problem. That approach works particularly well if you have a robust post-deploy test to verify any changes.

Some kinds of change lend themselves to shadow testing, where you run new code along side the old code and compare the results to be sure that they remain consistent. In the beginning, you would run the new code in parallel, compare with the old code, and continue to use the old results. As confidence increases, you would switch over to the new code.

A well designed logging system allows monitoring the behavior of any changes and captures information that can be used to recognize and troubleshoot any particular problems.

Conclusion

None of these approaches really removes any risk. They may mitigate risk by spreading it out, limiting the requests that are impacted, or making it easy to recognize a problem and back out.

Leave a Reply

Your email address will not be published. Required fields are marked *