DevOps is about Building Fords, not Ferraris

There is an interesting obsession with having the ‘ultimate’ of whatever you’re talking about. This applies to most things in our society: jobs, houses, televisions, cars. You name it, there is an ‘ultimate’ version that everyone aspires to have. There is a lot of good to this behavior, to be sure. I believe strongly that everyone should be trying to get better all the time. Though I would point out that it is healthier to regard the ultimate [whatever] as a consequence or benefit of getting better rather than an end unto itself.

But it’s usually bad to want the ‘ultimate’ in your software delivery process. Goldplating has always been an enemy in software projects and there is evidence of it in how a lot of organizations have traditionally delivered software. It usually shows up in the culture, where high-intervention processes lead to hero cults and aspirations to be the ultimate ‘hero’ who gets releases out the door. Old-school, old-world hand craftsmanship is the order of the day. DevOps is the exact opposite of this approach. It focuses on a highly repeatable, scalable, and mass-produced approach to releasing software. And frequently.

Which brings me back to the contrast between a Ferrari and a Ford. A Ferrari is pretty much the ultimate sports car and ultimate sports car brand. There really is very little not to like. But the cars are exotics still built with expensive materials using manual, old world techniques. To be fair, Ferrari has a super-modern robotic process for a lot of their precision work, but they add a lot of customization and hand-finishing. And they ship a very few thousand releases (cars) each year. Sustaining such a car in the real world involves specially trained mechanics named Giuseppe, long waits for parts from Italy, and even shipping the car across the state if you don’t live close to a qualified shop. No biggie – if you can afford the car, you can afford the maintenance. But, let’s face it, they are a ‘money is no object’ accessory.

Ford has shipped a variety of performance models over the years based on the Mustang platform. In fact, there have been years where Ford has shipped more performance Mustangs in a week than Ferrari would ship cars in that YEAR. And there is a magic there for a DevOps geek. Plain ol’ Ford Motor Company has started selling a 200mph Mustang this year for about $60K. There’s nothing too exotic about it. You can go to your local Ford dealer and buy it. It can be purchased at one dealer and serviced at any other dealer anywhere in the country. Parts? No problem – most of them are in local warehouses stationed strategically so that no dealer would have to keep a customer waiting too long for common items. A lot of stuff can be had from your local AutoZone because, well, it’s “just” a Mustang.

The lesson, though, is that Ford has an economy of scale by virtue of the volume of Mustangs it produces. No, a Mustang is not as nice or as custom as a Ferrari. It is as common and mass-produced as anything. But a 200mph car that anyone can buy for noticeably less than a house, get parts easily, and have serviced at thousands of locations is an amazing and magical thing. It teaches a solid lesson about scalability and sustainability that should be inspirational for DevOps teams.

And maybe, just maybe, if your company does a good enough job at sustainably delivering your software, you might be able to afford that Ferrari someday…

PS – for Chevy zealots. I realize the Corvette cleared 200 on a “volume” platform first. But the 200mph Plastic Fantastic looks more exotic relative to the Mustang – which has a plain “sporty commuter” or even rental fleet version with a V6. And the common example of the economies of scale mean that the 200mph Shelby Mustang is still a bargain relative to the 200mph capable ‘vette, which is the point of this post.

A System for Changing Systems – Part 9

The last capability area in the framework is that of Monitoring. I saved this for last because it is the one that tends to be the most difficult to get right. Of course, commensurate with the difficulty is the benefit gained when it is working properly. A lot of the difficulty and benefit with Monitoring comes from the fact that knowing what to look at, when to look at it and what NOT to look at are only the first steps. It also becomes important to know what distributed tidbits of information to bring together if you actually want a complete picture of your application environment.

Monitoring

Monitoring Capability Area

This post could go for pages – and Monitoring is likely going to be a consuming topic as this series progresses, but for the sake of introduction, lets look at the Monitoring capability area. The sub-capabilities for this area encompass the traditional basics of monitoring Events and Trends among them. The challenge for these two is in figuring out which Events to monitor and sometimes how to get the Event data in the first place. The Trends must then be put into a Report format that resonates with management. It is important to invest in this area in order to build trust with management that the team has control as it tries to increase the frequency of changes – without management’s buy-in, they won’t fund the effort. Finally, the Correlation sub-capability area is related to learning about the application system’s behavior and how changes to some part of the system impacts the other parts. This is an observational knowledge base that must be deliberately built by the team over time so that they can put the Events, Trends, and Reports into the most useful contexts and use the information to better understand risks and priorities when making changes to the system.

A System for Changing Systems – Part 8

The fourth capability area is that of Provisioning. It covers the group of activities for creating all or part of an environment in which an application system can run. This is a key capability for ensuring that application systems have the capacity they need to maintain performance and availability. It is also crucial for ensuring that development and test activities have the capacity they need to maintain THEIR performance. The variance with test teams is that a strong Provisioning capability also ensures that development and test teams can have clean dev/test environments that are very representative of prorduction environments and can very quickly refresh those dev/test environments as needed. The sub-capabilities here deal with managing the consistency of envionment configurations, and then quickly building environments to a known state.

Provisioning Capability Area

Provisioning Capability Area

The fifth capability area is closely related to Provisioning. It is the notion of a System Registry capability. This set of capabilities deals with delivering the assumed infrastructure functions (e.g. DNS, e-mail relays, IP ranges, LDAP, etc.) that surround the environments. These capabilities must be managed in such a way that one or more changes to an application system can be added to a new or existing environment with out significant effort or disruption. In many ways this capability area is the fabric in which the others operate. It can also be tricky to get right because this capability area often spans multiple application systems.

System Registry Capability Area

System Registry Capability Area

A System for Changing Systems – Part 7 – Deployment Capabilities

The third capability area is that of Deployment. Deployment deals with the act of actually putting the changes into a given target environment. It is not prescritive of how this happens. Many shops mechanically deal with deployment via their provisioning system. That is obviously a good thing and an efficiency gain by removing a discrete system for performing deployment activities. It is really a best practice of the most mature organizations. However, this taxonomy model is about identifying the capabilities needed to consistently apply changes to a whole application system. And, lets face it, best practices tend to be transient; as new, even better, best practices emerge.

Deployment Capability Area

Deployment Capability Area

Additionally, there are a number of reasons the capability is included in this taxonomy. First of all, the framework is about capabilities rather than technologies or implementations. It is important to be deliberate about how changes are deployed to all environments and simply because some group of those changes are handled by a provisioning tool does not remove the fact that not all are covered nor does it remove the fact that some deliberate work is expended in fitting the changes into the provisioning tool’s structure. Most provisioning tools, for example are set up to handle standard package mechanisms such as RPM. The deployment activity in that scenario is more one of packaging the custom changes. But the provisioning answer is not necessarily a solution for all four core areas of an applpication system, so there needs to be a capability that deals generically with all of them. Finally, many, if not most, shops have some number of systems where there are legacy technical requirements that require deployment to happen separately.

All of that being true, the term “Deployment” is probably confusing given its history and popular use. It will likely be replaced in the third revision of this taxonomy with something more generic, such as “Change Delivery”.

The sub-category of Asset Repository refers to the fact that there needs to be an ability to maintain a collection of changes that can be applied singly or in bulk to a given application system. In the third revision of the taxonomy, it is likely to be joined by a Packaging sub-capability.  Comments and thoughts are welcome as this taxonomy is evolving and maturing along with the DevOps movement.

A System for Changing Systems – Part 6 – Change Management and Orchestration Capabilities

This post covers the first two capability areas in the system taxonomy. This discussion will begin with where the changes come into the “system for changing systems”, Change Management, and proceed around the picture of top-level capability areas.

The first capability area to look at is Change Management. Change is the fundamental reason for this discussion and, in many ways, the discussion is pointless unless this capability is well understood. Put more simply, you can not apply changes if you do not know what the changes are. As a result, this capability area is the change injector for the system. It is where changes to the four components of the application system are identified, labeled and tracked as they are put into place in each environment. For convenience and in recognition of the fact that changes are injected from both the “new feature” angle as well as from the “maintenance item” angle, the two sources of change are each given their own capability sub-area.

Change Management Capability Area

Change Management Capability Area

The second capability area is that of Orchestration. In a complex system that is maintained by a combination of human and machine-automated prcoesses, understanding what is done, by whom, and in what order is important. This capability area has two sub-areas – one for the technical side and one for the people. This reflects the need to keep the technical dependencies properly managed and also to keep everyone on the same page. Orchestration is a logical extension of the changes themselves. Once you know what the changes are, everyone and everything must stay synchronized on when and where those changes are applied to the application system.

Orchestration Capability Area

Orchestration Capability Area

A System for Changing Systems – Part 5 – Top-level Categories

The first step to understanding the framework is to define the broad, top level capability areas. A very common problem in technology is the frequent over-use of terms that can have radically different meanings depending on the context of a conversation. So, as with any effort to clarify the discussion of a topic, it is very critical to define terms and hold to those definitions during the course of the discussion.

Top level categories of capabilities around various environments in which applications typically must run.

Top level capability areas for sustaining application systems across environments.

At the top level of this framework are six capability groupings

  • Change Management – This category is for capabilities that ensure that changes to the system are properly understood and tracked as they happen. This is a massively overused term, but the main idea for this framework is that managing changes is not the same thing as applying them. Other capabilities deal with that. This capability category is all about oversight.
  • Orchestration – This category deals with the ability to coordinate activity across different components, areas, and technologies in a complex distributed application system in a synchronized manner
  • Deployment – This category covers the activities related to managing the lifecycles of an application systems’ artifacts through the various environments. Put more simply this area deals with the mechanics of actually changing out pieces of an application system.
  • Monitoring – The monitoring category deals with instrumenting the environment for various purposes. This instrumentation concept covers all pieces of the application system and provides feedback in the appropriate manner for interested stakeholders. For example, capacity usage for operations and feature usage for development.
  • System Registry – This refers to the need for a flexible and well-understood repository of shared information about the infrastructure in which the application system runs. This deals with the services on which the application system depends and which may need to be updated before a new instance of the application system can operate correctly.
  • Provisioning – This capability is about creating and allocating the appropriate infrastructure resources for an instance of the application system to run properly. This deals with the number and configuration of those resources. While this area is related to deployment, it is separate because in many infrastructures it may not be desireable or even technically possible to provision fresh resources with each deployment and linking the two would blunt the relevancy of the framework.

The next few posts will dig into the sub-categories underneath each of these top-level items.

How Fast Should You Change the Tires?

I am an unabashed car nut and like to watch a variety of motor racing series. In particular I tend to stay focused on Formula 1 with a secondary interest in the endurance series (e.g Le Mans). In watching several races recently, I observed that the differences in how each series managed tire changes during pit stops carried some interesting analogies to deploying software quickly.

Each racing series has a different set of rules and limitations with regard to how pit stops may be conducted. These rules are imposed for a combination of safety reasons, competitive factors, and the overall viability of the racing series. There are even rules about changing tires. Some series enable very quick tire changes – others less so. The reasons behind these differences and how they are applied by race teams in tight, time competitive situations can teach us lessons about the haste we should or should NOT have when deploying software.

Why tire changes? The main reason is that, like deploying software, there are multiple potential points of change (4 tires on the car – software, data, systems, network with the software). And, in both situations, it is less important how fast you can change just one of them than how fast you change all of them. There is even the variants where you may not need to change all 4 tires (or system components) every time, but you must be precise in your changes.

Formula 1

Formula 1 is a fantastically expensive racing series and features extreme everything – including the fastest pit stops in the business. Sub 4-second stops are the norm, during which all 4 tires are changed. There are usually around18 people working on the car – 12 of whom are involved in getting the old tires off and clear while putting new tires on (not counting another 2 to work the jacks). That is a large team, with a lot of expensive people on it, who invest a LOT of expensive time practicing to ensure that they can get all 4 tires changed in a ridiculously short period of time. And they have to do it for two cars with potentially different tire use strategies, do it safely, while competing in a sport that measures advantage in thousandths of a second.

But, there is a reason for this extreme focus / investment in tire changes. The tire changes are the most complex piece of work being done on the car during a standard pit stop. Unlike other racing series, there is no refueling in Formula 1 – the cars must have the range to go the full race distance. In fact, the races are distance and time limited, so the components on the cars are simply engineered to go that distance without requiring service, and therefore time, during the race. There are not even windows to wash – it is an open cockpit car. So, the tires are THE critical labor happening during the pit stop and the teams invest accordingly.

Endurance (Le Mans)

In contrast to the hectic pace of a Formula 1 tire change is Endurance racing. These are cars that are built to take the abuse of racing for 24 hours straight. These cars require a lot of service over the course of that sort of race and the tires are therefore only one of several critical points that have to be serviced in the course of a race. Endurance racers have to be fueled, have brake components replaced, and the three drivers have to switch out periodically so they can rest. The rules of this series, in fact limit the number of tire wrenches the team can use in the pits to just one. That is done to discourage teams from cutting corners and also to keep team size (and therefore costs) down.

NASCAR

NASCAR is somewhere between Formula 1 and Endurance racing when it comes to tire changes. This series limits tire wrenches to two and tightly regulates the number of people working on the car during a pit stop. These cars require fuel, clean-up, and tires just like the Endurance cars, but generally do not require any additional maintenance during a race, barring damage. So, while changing tires quickly is important, there are other time eating activities going on as well.

Interestingly, in addition to safety considerations, NASCAR limits personnel to keep costs down to help the teams competing in the series afford the costs of doing so. That keeps the overall series competition healthy by ensuring a good number of participants and the ability of new teams to enter. Which, to contrast, is one of the problems that Formula 1 has had over the years.

In comparing the three approaches to the same activity, you see an emerging pattern where ultimate speed of changing tires gets traded based on cost and contextual criticality. These are the same trade-offs that are made in a business when it looks at how much faster it can perform a regular process such as deploy software. You could decide you want sub-four second tire changes, but that would be dumb if your business needs 10 seconds for refueling or several minutes for driver swaps and brake overhauls. And if they do, your four second tire change would look wasteful at best as your army of tire guys stands around and watches the guy fueling the car or the new driver adjusting his safety harnesses.

The message here is simple – understand what your business needs when it comes to deployment. Take the thrill of speed out of it and make an unemotional decision to optimize; knowing that optimal is contextually fastest without waste. Organizations that literally make their living from speed undestand this. You should consider this the next time you go looking to do something faster.