Predictability is Predictably Hard

In order to successfully automate something, the pieces being automated have to be ‘predictable’. I use ‘predictable’ here – rather than ‘consistent’ – deliberately. A ‘predictable’ environment means you can anticipate its state and configuration. ‘Consistent’ gets misconstrued as ‘unchanging’, which is the opposite of what Agile software delivery is trying to achieve.

Consider deploying a fresh build of an application into a test environment. If you cannot predict what the build being deployed looks like and how the environment will be set up, why would you expect to reliably be able to get that build working in that environment in a predictable window of time? And yet, that is exactly what so many teams do.

The proposed solution is usually to automate the deployment. That, however, leads to its own problems if you do not address the predictability of the underlying stuff being automated. I talk to teams with stories about how they abandoned automation because it ‘slowed things down’ or ‘just did not work’. That leads teams to say, and in some cases believe, that their applications are ‘too complex to deploy automatically’.

At the heart of achieving predictability of the code packages and environments is the fact that they are different teams. Somehow it is harder to collaborate with the developers or operations team than it is to spend months attempting to build a mountain of hard to maintain deployment code. A mountain of code that stands a good chance of being abandoned, by the way. That represents months of wasted time, effort, and life because people working on the same application do not collaborate or cooperate.

And we get another example of why so many DevOps conversations become about culture rather than technology… Which really sucks, because that example is at the expense of a fair bit of pain from the real people on those teams.

The lesson here is that there is no skipping the hard work of establishing predictability in the packaging of the code and environments before charging into automating deployments. We are in an era now where really good packaging and configuration management tools are very mature.
And the next generation of tools that unifies the code and environment changes into immutable, deployable, and promotable artifacts is coming fast. But even with the all of these awesome tools, cross-disciplinary experts will have to come together to contribute to the creation of predictable versions of those artifacts.

The ‘C’ in CAMS stands for “Collaboration”. There are no shortcuts.

This article is also on LinkedIn here: https://www.linkedin.com/pulse/predictability-predictably-hard-dan-zentgraf/

Your Deployment Doc Might Not be Useful for DevOps

One of the most common mistakes I see people making with automation is the assumption that they can simply wrap scripts around what they are doing today and be ‘automated’. The assumption is based around some phenomenally detailed runbook or ‘deployment document’ that has every command that must be executed. In ‘perfect’ sequence. And usually in a nice bold font. It was what they used for their last quarterly release – you know, the one two months ago? It is also serving as the template for their next quarterly release…

It’s not that these documents are bad or not useful. They are actually great guideposts and starting points for deriving a good automated solution to releasing software in their environment. However, you have to remember that these are the same documents that are used to guide late night, all hands, ‘war room’ deployments. The idea that their documented procedures are repeatablly automate-able is suspect, at best, based on that observation alone.

Deployment documents break down as an automate-able template for a number of reasons. First, there are almost always some number of undocumented assumptions about the state of the environment before a release starts. Second, using the last one does not account for procedural, parameter, or other changes between the prior and the upcomming releases. Third, the processes usually unconsciously rely on interpretation or tribal knowledge on the part of the person executing the steps. Finally, there is the problem that steps that make sense in a sequential, manual process will not take advantage of the intrinsic benefits of automation, such as parallel execution, elimination of data entry tasks, and so on.

The solution is to never set the expectation – particularly to those with organizational power – that the document is only a starting point. Build the automation iteratively and schedule multiple iterations at the start of the effort. This can be a great way to introduce Agile practices into the traditionally waterfall approaches used in operations-centric environments. This approach allows for the effort that will be required to fill in gaps in the document’s approach, negotiate standard packaging and tracking of deploy-able artifacts, add environment ‘config drift’ checks, or any of the other common ‘pitfall’ items that require more structure in an automated context.

This article is also on LinkedIn here: https://www.linkedin.com/pulse/your-deployment-doc-might-useful-devops-dan-zentgraf

A System for Changing Systems – Part 4 – Groundwork for Understanding the Capabilities of a System Changing System

In the last couple of posts, we have talked about how application systems need a change application system around them to manage the changes to the application system itself. A “system to manage the system” as it were. We also talked about the multi-part nature of application systems and the fact that the application systems typically run in more than one environment at any given time and will “move” from environment to environment as part of their QA process. These first three posts seek to set a working definition of the thing being changed so that we can proceed to a working definition of a system for managing those changes. This post starts that second part of the series – defining the capabilities of a change application system. This definition will then serve as the base for the third part – pragmatically adopting and applying the capabilities to begin achieving a DevOps mode of operation.

DevOps is a large problem domain with many moving parts. Just within the first set of these posts, we have seen how four rather broad area definitions can multiply substantially in a typical environment. Further, there are aspects of the problem domain that will be prioritized by different stakeholders based on their discipline’s perspective on the problem. The whole point of DevOps, of course, is to eliminate that perspective bias. So, it becomes very important to have some method for unifying the understanding and discussion of the organizations’ capabilities. In the final analysis, it is not as important what that unified picture looks like as it is that the picture be clearly understood by all.

To that end, I have put together a framework that I use with my customers to help in the process of understanding their current state and prioritizing their improvement efforts. I initially presented this framework at the Innovate 2012 conference and subsequently published an introductory whitepaper on the IBM developerWorks website. My intent with these posts is to expand the discussion and, hopefully, help folks get better faster. The interesting thing to me is to see folks adopt this either as is or as the seed of something of their own. Either way, it has been gratifying to see folks respond to it in its nascent form and I think the only way for it to get better is to get more eyeballs on it.

So, here is my picture of the top-level of the capability areas (tools and processes) an organization needs to have to deliver changes to an application system.

Capabilities

Overview of capability areas required to sustain environments

The quality and maturity of these within the organization will vary based on their business needs – particularly around formality – and the frequency with which they need to apply changes.

I applied three principles when I put this together:

  • The capabilities had to be things that exist in all environments that application system runs (ie dev, test, prod, or whatever layers exist). THe idea here is that such a perspective will help unify tooling and approaches to a theoretical ideal of one solution for all environments.
  • The capabilities had to be broad enough to allow for different levels of priority / formality depending on the environment. The idea is to not burden a more volatile test environment with production-grade formality or vice-versa. But to allow a structured discussion of how the team will deliver that capability in a unified way to the various environments. DevOps is an Agile concept, so the notion of minimally necessary applies.
  • The capabilities had to be generic enough to apply to any technology stack that an organization might have. Larger organizations may need multiple solutions based on the fact that they have many application systems that were created at different points in time, in different languages, and in different architectures. It may not be possible to use exactly the same tool / process in all of those environments, but it most certainly is possible to maintain a common understanding and vocabulary about it.

In the next couple of posts, I will drill a bit deeper into the capability areas to apply some scope, focus, and meaning.

A System for Changing Systems – Part 3 – How Many “Chang-ee”s

As mentioned in the last post, once there is a “whole system” understanding of an application system, the next problem is that there are really multiple variants of that system running within the organization at any given time. There are notionally at least three: Development, Test, and Production. In reality, however, most shops frequently have multiple levels of test and potentially more than one Development variant. Some even have Staging or “Pre-production” areas very late in test where the modified system must run for some period before finally replacing the production environment. A lot of this environment proliferation is based on historic processes that are themselves a product of the available tooling and lessons organizations have learned over years of delivering software.

Example Environment Flow

This is a simplified, real-world example flow through some typical environments. Note the potential variable paths – another reason to know what configuration is being tested.

Tooling and processes are constantly evolving. The DevOps movement is really a reflection of the mainstreaming of Agile approaches and cloud-related technologies and is ultimately a discussion of how to best exploit it. That discussion, as it applies to environment proliferation, means we need to get to an understanding of the core problems we are trying to solve. The two main problem areas are maintaining the validity of the sub-production environments as representative of production and tracking the groupings of changes to the system in each of the environments.

The first problem area, that of maintaining the validity of sub-production envrionments, is a more complex problem than it would seem. There are organizational silo problems where multiple different groups own the different environments. For example, a QA group may own the lab configuraitons and therefore have a disconnect relative to the production team. There are also multipliers associated with technical specialities, such as DBAs or Network Administration, which may be shared across some levels of environment. And if the complexity of the organization was not enough, there are other issues associated with teams that do not get along well, the business’ perception that test environments are less critical than production, and other organizational dynamics that make it that much more difficult to ensure good testing regimes are part of the process.

The second key problem area that must be addresssed is tracking the groups of changes to the application system that are being evaluated in a particular sub-production environment. This means having a unique identifier for the combination of application code, the database schema and dataset, system configuration, and network configuration. That translates to five version markers – one for each of the main areas of the application system plus one for the particular combination of all four. On the surface, this is straightforward, but in most shops, there are few facilities for tracking versions of configurations outside of software code. Even when they are, they are too often not connected to one another for tracking groupings of configurations.

They typical pattern for solving these two problems actually begins with the second problem first. It is difficult to ensure the validity of a test environment if there is no easy way to identify and understand the configuration of the components involved. This is why many DevOps initiatives start with configuration management tools such as Puppet, Chef, or VMWare VCenter. It is also why “all-in-one” solutions such as IBM’s Pure family are starting to enter the market. Once an organization can get a handle on their configurations, then it is substantially easier to have fact-based engineering conversations about valid test configurations and environments because everyone involved has a clear reference for understanding exactly what is being discussed.

This problem discussion glosses over the important aspect of being able to maintain these tools and environments over time. Consistently applying the groups of changes to the various environments requires a complex system by itself. The term system is most appropirate because the needed capabilities go well beyond the scope of a single tool and then those capabilities need to be available for each of the system components. Any discussion of such broad capabilities is well beyond the scope of a single blog post, so the next several posts in this series will look at framework for understanding the capabilities needed for such a system.

A System for Changing Systems – Part 1 – Approach

This is the first post in a series which will look at common patterns among DevOps environments.  Based on these patterns, they will attempt to put a reasonable structure together that will help organizations focus DevOps discussions, prioritize choices, and generally improve how they operate.

In the last post, I discussed how many shops take the perspective of developing a system for DevOps within their environments.  This notion of a “system for changing systems” as a practical way of approaching DevOps requires two pieces.  The first is the system being changed – the “change-ee” system.  The second is the system doing the changing – the “DevOps”, or “change-er” system.  Before talking about automatically changing something, it is necessary to have a consistent understanding of the thing being changed.  Put another way, no automation can operate well without a deep understanding of the thing being automated.  So this first post is about establishing a common language for generically understanding the application systems; the “change-ee” systems in the discussion.

A note on products, technologies and tools…  Given the variances in architectures for application (“change-ee”) systems, and therefore the implied variances on the systems that apply changes to them, it is not useful to get product prescriptive for either.  In fact, a key goal with this framework is to ensure that it is as broadly applicable and useful as possible when solving DevOps-related problems in any environment.  That would be very difficult if it overly focused on any one technology stack.  So, these posts will not necessarily name names other than to use them as examples of categories of tools and technologies.

With these things in mind, these posts will progress from the inside-out.  The next post will begin the process with a look at the typical components in an application system (“change-ee”).  From there, the next set of posts will discuss the capabilities needed to systematically apply changes to these systems.  Finally, after the structure is completed, the last set of posts will look at the typical progression of how organizations build these capabilities.

The next post will dive in and start looking at the structure of the “change-ee” environment.

DevOps is about Developing a Change Application System

As the DevOps movement rolls on, there is a pattern emerging. Some efforts are initiated by development, seeking relief on test environment management. Others are initiated by operations departments trying to get more automation and instrumentation into the environments they manage. I frequently hear comments that are variations on “same stuff I’ve been doing for xx years, different title” from people who have DevOps in their job title or job description. Their shops are hoping that if they encourage folks to think about DevOps and maybe try some new tools, they will get the improvements promised by DevOps discussions online. Well, just like buying a Ferrari and talking it up won’t make you Michael Schumacher, having Puppet or Chef to do your configuration management won’t “make you DevOps” (whatever that means). Successful DevOps shops are bypassing the window dressing and going at DevOps as a project unto itself.

There are a number of unique aspects to undertaking a project such as this. They require a holistic perspective on the process, touch a very broad range of activities, and provide an approach for changing other systems while being constantly changed themselves.

These projects are unique in the software organization because they require a team to look at the whole end-to-end approach to delivering changes to the application systems within that organization FROM THE SIDE; rather than from a position somewhere in the middle of the process. This is an important difference in approach, because it forces a change in perspective on the problem. Typically, someone looking from either the development or the operations “end” of the process will often suffer from a perceptive problem where the “closer” problems in the process look bigger than the ones “farther” up or down the process line. It is a very human thing to be deceived by the perspective of our current position. After all, there are countless examples of using perspective for optical illusions. Clever Leaning Tower of Pisa pictures (where someone appears to be holding it up) and the entire Lord of the Rings movie trilogy (the actors playing the hobbits are not that short) provide easy examples. Narrowness of perspective is, in fact, a frequent reason that “grassroots” efforts’ fail outside of small teams. Successfully making large and impactful changes requires a broader perspective.

The other breadth-related aspect of these programs is that they touch a very wide range of activities over time and seek to optimize for flow both through and among each. That means that they have some similarities with supply chain optimization and ERP projects if not in scale, then in complexity. And the skills to look at those flows probably do not exist directly within the software organization, but in the business units themselves. It can be difficult for technology teams, that see themselves as critical suppliers of technology to business units, to accept that there are large lessons to be learned about technology development from the business units. It takes a desire to learn and change at a level well above a typical project.

A final unique part is that there must be ongoing programs for building and enhancing a system for managing consistent and ongoing changes in other systems. Depending on your technology preference, there are plenty of analogies from pipelines, powergrids and aircraft that apply here. Famous and fun ones are the flight control systems of intrinsically unstable aircraft such as the F-16 fighter or B-2 bomber. These planes use technology to adjust control surfaces within fractions of a second to maintain steady and controled flight within the extreme conditions faced by combat aircraft. Compared to that, delivering enhancements to a release automation system every few weeks sounds trivial, but maintaining the discipline and control to do so in a large organization can be a daunting task.

So the message here is to deliberately establish a program to manage how changes are applied. Accept that it is going to be a new and unusual thing in your organization and that it is going to require steady support and effort to be successful. Without that acceptance, it will likely not work out.

My next few posts are going to dig into this deeper and begin looking at the common aspects of these programs, how people approach the problem, how they organize and prioritize their efforts, and the types of tools they are using.

Another Example of Grinding Mental Gears

I recently got a question from a customer who was struggling with the ‘availability’ of their sub-production environments. The situation brought into focus a fundamental disconnect between the Ops folks who were trying to maintain a solid set of QA environments for the Dev team and what the Dev teams needed. To a large extend this is a classic DevOps dilemma, but the question provides an excellent teaching moment. Classic application or system availability as defined for a production situation does not really apply to Dev or multi-level Test environments.

Look at it this way. End user productivity associated with a production environment is based upon the “availability” of the application. Development and Test productivity is based upon the ability to view chagnes to the application in a representative (pre-production) environment. In other words the availability of the _changer_ in pre-production is more valuable to Dev productivity than any specific pre-production instance of the application environment. Those application environment instances are, in fact, disposable by definition.

Disposability of a running application environment is a bit jarring to Ops folks when they see a group of users (developers and testers in this case) needing the system. Everything in Ops tools and doctrine is oriented toward making sure that an application environment gets set up and STAYS that way. That focus on keeping things static is exactly the point to which DevOps is a reaction.  Knowing that does not make it easy to make the mental shift, of course.  Once made, however, it is precisely why tools that facilitate rapidly provisioning environments are frequently the earliest arrivals when most organizations seek to adopt DevOps.

About Those “QA” Environments…

DevOps is about getting developed software to users faster and getting feedback from that software back to developers faster. The notion of a clean cycle with very low latency is a compelling vision. Most IT shops are struggling with how to get there and many must maintain a division between the full production environment and the test environments that lead to production. Some of the reasons are historical, some are managerial, and some are truly related to the business environment in which that organization operates.

Fortunately or unfortunately, most shops have plenty to do before they need to worry about tying production in more directly. The state of QA environments is usually relatively weakly managed. Part of that is historic – the environments are not viewed as very important and it is never quite clear who owns keeping them properly current to the production environment. Part of this is the traditional focus of narrowly testing features of the application to minimize test cycles. And part, too, is a historical view that lab configurations did not matter as long as they were ‘close enough’ for testing features.

Of course recent lessons are teaching us that the historical approach is not necessarily conducive to rapid iteration of software. We are also learning that theoretically small changes to the production environments can potentially invalidate theoretically tested deliverables. Combined with the advent of very good rapid provisioning systems, automated configuration management tools, and highly virtualized infrastructures there are few reasons to not have a first-rate QA environment.

But is the paradigm of QA “environments” really the right paradigm for how we approach rapidly releasing features into the wild? As teams try to lessen the notion of a big, standing lab environment for testing software, the approach looks somewhat less like traditional testing and more like qualifying a new feature for use in the system. This is a subtle difference. “Quality Assured” and “Qualified for Use” are two different notions. One says you delivered what you set out to deliver. The other says you know it works in some situation. Some would say that “Quality” implies the latter, but I would answer that if you have to parse a definition to get a meaning, you probably are using the wrong word.

But words only matter to a point, it is the pardigm they represent that is ultimately interesting and impactful. There are extreme examples in the “real” world. For example, just because a part is delivered with quality to its design goals does NOT mean that it is certified for use in a plane. As someone who flies often, I view this as good.

So the question I would ask is whether you simply test to see if it meets design goals in some hopefully representative lab somewhere or do you use DevOps techniques to truly qualify releases for use in the real production environment.

How Fast Should You Change the Tires?

I am an unabashed car nut and like to watch a variety of motor racing series. In particular I tend to stay focused on Formula 1 with a secondary interest in the endurance series (e.g Le Mans). In watching several races recently, I observed that the differences in how each series managed tire changes during pit stops carried some interesting analogies to deploying software quickly.

Each racing series has a different set of rules and limitations with regard to how pit stops may be conducted. These rules are imposed for a combination of safety reasons, competitive factors, and the overall viability of the racing series. There are even rules about changing tires. Some series enable very quick tire changes – others less so. The reasons behind these differences and how they are applied by race teams in tight, time competitive situations can teach us lessons about the haste we should or should NOT have when deploying software.

Why tire changes? The main reason is that, like deploying software, there are multiple potential points of change (4 tires on the car – software, data, systems, network with the software). And, in both situations, it is less important how fast you can change just one of them than how fast you change all of them. There is even the variants where you may not need to change all 4 tires (or system components) every time, but you must be precise in your changes.

Formula 1

Formula 1 is a fantastically expensive racing series and features extreme everything – including the fastest pit stops in the business. Sub 4-second stops are the norm, during which all 4 tires are changed. There are usually around18 people working on the car – 12 of whom are involved in getting the old tires off and clear while putting new tires on (not counting another 2 to work the jacks). That is a large team, with a lot of expensive people on it, who invest a LOT of expensive time practicing to ensure that they can get all 4 tires changed in a ridiculously short period of time. And they have to do it for two cars with potentially different tire use strategies, do it safely, while competing in a sport that measures advantage in thousandths of a second.

But, there is a reason for this extreme focus / investment in tire changes. The tire changes are the most complex piece of work being done on the car during a standard pit stop. Unlike other racing series, there is no refueling in Formula 1 – the cars must have the range to go the full race distance. In fact, the races are distance and time limited, so the components on the cars are simply engineered to go that distance without requiring service, and therefore time, during the race. There are not even windows to wash – it is an open cockpit car. So, the tires are THE critical labor happening during the pit stop and the teams invest accordingly.

Endurance (Le Mans)

In contrast to the hectic pace of a Formula 1 tire change is Endurance racing. These are cars that are built to take the abuse of racing for 24 hours straight. These cars require a lot of service over the course of that sort of race and the tires are therefore only one of several critical points that have to be serviced in the course of a race. Endurance racers have to be fueled, have brake components replaced, and the three drivers have to switch out periodically so they can rest. The rules of this series, in fact limit the number of tire wrenches the team can use in the pits to just one. That is done to discourage teams from cutting corners and also to keep team size (and therefore costs) down.

NASCAR

NASCAR is somewhere between Formula 1 and Endurance racing when it comes to tire changes. This series limits tire wrenches to two and tightly regulates the number of people working on the car during a pit stop. These cars require fuel, clean-up, and tires just like the Endurance cars, but generally do not require any additional maintenance during a race, barring damage. So, while changing tires quickly is important, there are other time eating activities going on as well.

Interestingly, in addition to safety considerations, NASCAR limits personnel to keep costs down to help the teams competing in the series afford the costs of doing so. That keeps the overall series competition healthy by ensuring a good number of participants and the ability of new teams to enter. Which, to contrast, is one of the problems that Formula 1 has had over the years.

In comparing the three approaches to the same activity, you see an emerging pattern where ultimate speed of changing tires gets traded based on cost and contextual criticality. These are the same trade-offs that are made in a business when it looks at how much faster it can perform a regular process such as deploy software. You could decide you want sub-four second tire changes, but that would be dumb if your business needs 10 seconds for refueling or several minutes for driver swaps and brake overhauls. And if they do, your four second tire change would look wasteful at best as your army of tire guys stands around and watches the guy fueling the car or the new driver adjusting his safety harnesses.

The message here is simple – understand what your business needs when it comes to deployment. Take the thrill of speed out of it and make an unemotional decision to optimize; knowing that optimal is contextually fastest without waste. Organizations that literally make their living from speed undestand this. You should consider this the next time you go looking to do something faster.

Rollback Addiction

A lot of teams are fascinated with the notion of ‘rollback’ in their environments.  They seem to be addicted to its seductive conceptual simplicity.  It does, of course, have valid uses, but, like a lot of things, it can become a self-destructive dependency if it is abused.  So, let’s take a look at some of the addictive properties of rollback and how to tell if you might have a problem.

Addictive property #1 – everybody is doing it.  One of the first things we learn in technology (Dev or Ops) is to keep a backup of a file when we change something.  That way, we know what we did in case it causes a problem.  In other words, our first one is free and was given to us by our earliest technology mentors.  As a result, everyone knows what it is.  It is familiar and socially acceptable.  We learn it while we are so professionally young that it becomes a “comfort food”.  The problem is that it is so pervasive that people do it without noticing.  They will revert things without giving it a second thought because it is a “good thing”.  And this behavior is a common reason rollback does not scale.  In a large system, where many people might be making changes, others might make changes based on your changes.  So, undoing yours without understanding others’ dependencies, means that you are breaking other things in an attempt to fix one thing.  If you are in a large environment and changes “backward” are not handled with the exact same rigor as changes forward, you might have a problem.

Addictive property #2 – it makes you feel good and safe.  The idea that “you are only as good as your last backup” is pretty pervasive.  So, the ability to roll something back to a ‘known good’ state gives you that warm, fuzzy feeling that it’s all OK.  Unfortunately, in large scale situations with any significant architectural complexity, it is probably not OK.  Some dependency is almost certainly unknown, overlooked or assumed to be handled manually.  That will lead to all sorts of “not OK” when you try to roll back.  If rollback is the default contingency plan for every change you make and you don’t systematically look at other options to make sure it is the right answer, you might have a problem.

Addictive property #3 – It is easy to sell.  Management does not understand the complexity of what is required to implement a set of changes, but they do understand “Undo”.  As a result, it is trivial to convince them that everything is handled and, if there should be a problem with a change, you can just ‘back it out’.  Being able to simplify the risks to an ‘undo’ type of concept can eliminate an important checkpoint from the process.  Management falls into the all to human behavior of assuming there is an ‘undo’ for everything and stops questioning the risk management plan because they think it is structurally covered.  This leads to all sorts of ugliness should there be a problem and the expectation of an easy back-out is not met.  Does your team deliberately check for oversights its contingency plan every time or does it assume that it will just ‘roll it back’?  If the latter, you might have a problem.

As usual, the fix for a lot of this is self-discipline.  The discipline to do things the hard and thorough way that takes just that little bit longer.  The discipline to institutionalize and reward healthy behaviors in your shop.  And, as usual, that goes against a fair bit of human nature that can be very difficult to overcome, indeed.