A lot of teams are fascinated with the notion of ‘rollback’ in their environments. They seem to be addicted to its seductive conceptual simplicity. It does, of course, have valid uses, but, like a lot of things, it can become a self-destructive dependency if it is abused. So, let’s take a look at some of the addictive properties of rollback and how to tell if you might have a problem.
Addictive property #1 – everybody is doing it. One of the first things we learn in technology (Dev or Ops) is to keep a backup of a file when we change something. That way, we know what we did in case it causes a problem. In other words, our first one is free and was given to us by our earliest technology mentors. As a result, everyone knows what it is. It is familiar and socially acceptable. We learn it while we are so professionally young that it becomes a “comfort food”. The problem is that it is so pervasive that people do it without noticing. They will revert things without giving it a second thought because it is a “good thing”. And this behavior is a common reason rollback does not scale. In a large system, where many people might be making changes, others might make changes based on your changes. So, undoing yours without understanding others’ dependencies, means that you are breaking other things in an attempt to fix one thing. If you are in a large environment and changes “backward” are not handled with the exact same rigor as changes forward, you might have a problem.
Addictive property #2 – it makes you feel good and safe. The idea that “you are only as good as your last backup” is pretty pervasive. So, the ability to roll something back to a ‘known good’ state gives you that warm, fuzzy feeling that it’s all OK. Unfortunately, in large scale situations with any significant architectural complexity, it is probably not OK. Some dependency is almost certainly unknown, overlooked or assumed to be handled manually. That will lead to all sorts of “not OK” when you try to roll back. If rollback is the default contingency plan for every change you make and you don’t systematically look at other options to make sure it is the right answer, you might have a problem.
Addictive property #3 – It is easy to sell. Management does not understand the complexity of what is required to implement a set of changes, but they do understand “Undo”. As a result, it is trivial to convince them that everything is handled and, if there should be a problem with a change, you can just ‘back it out’. Being able to simplify the risks to an ‘undo’ type of concept can eliminate an important checkpoint from the process. Management falls into the all to human behavior of assuming there is an ‘undo’ for everything and stops questioning the risk management plan because they think it is structurally covered. This leads to all sorts of ugliness should there be a problem and the expectation of an easy back-out is not met. Does your team deliberately check for oversights its contingency plan every time or does it assume that it will just ‘roll it back’? If the latter, you might have a problem.
As usual, the fix for a lot of this is self-discipline. The discipline to do things the hard and thorough way that takes just that little bit longer. The discipline to institutionalize and reward healthy behaviors in your shop. And, as usual, that goes against a fair bit of human nature that can be very difficult to overcome, indeed.