The “Other” Value Levers of Automation – Part 2 – Democratizing Expertise

The basic meaning of democratization (in a non-political sense) is to make something accessible to everyone. This is the core of so much software that is written today that it is highly ironic how it is rarely systematically applied to the process of actually producing software. However, it is this aspect of automation that is a key _reason_ why automation delivers such throughput benefits. By encapsulating complexity and expertise into something easily consumed, novices can perform tasks in which they are not expert and do so on demand. In other words, it makes the ‘scarce expertise’ bottleneck can be made irrelevant.

Scaled software environments are now far too complex and involve too many integrated technologies for there to be anyone who really understands all of the pieces at a detailed level. Large scale complexity naturally drives the process of specialization. This has been going on for ages in society at large and there are plenty of studies that describe how we could not really have cities if we did not parcel out all of the basic tasks of planning, running, and supplying the city to many specialists. No one can be an expert in power plants, water plants, sewage treatment plants, and all of the pumps, circuits, pipes, and pieces in them. So, we have specialists.

Specialists, however, create a natural bottleneck. Even in a large situation where you have many experts in something, the fact that the people on the scene are unable to take action means they are waiting and, people who depend on that group are waiting. A simple example is unstopping a clogged pipe. Not the world’s most complex issue, but it is a decent lens for illustrating the bottleneck factor. On one hand, if you don’t know anything about plumbing, then you have to call (and wait for) a plumber. Think of the time saved if a plumber was always right next to the drain and could jump right in and unclog that pipe.

Example problems, such as plumbing, that require physical fixes are much harder, of course. In the case of a scaled technology environment, we are fortunate to be able to work with much more malleable stuff – software and software-defined infrastructure. Before we get too excited by that, however, we should remember that, while our environments are far easier to automate than, say, a PVC pipe, we still face the knowledge and tools barrier. And the fact that technology organizations have a lot people waiting and depending on technology ‘plumbers’ is one of the core drivers of why the DevOps movement is so resonant in the first place.

Consider the situation where developers need an environment in which to build new features for an application system. If developers in that environment can click a button and have a fully operational, representative infrastructure for their application system provisioned and configured in minutes, it is because the knowledge of how to do that has been captured. That means that ever time a developer needs to refresh their environment, a big chunk of time is saved by not having to wait on the ‘plumber’ (expert). And that is before taking into account the fact that the removal of that dependency on the expert allows the developer to more frequently refresh their environment – which creates opportunities to enhance quality and productivity. And a similar value proposition exists for testers, demo environments, etc. Even if the automated process is no faster than the expert-driven approach it replaces, the removal of the wait time delivers a massive value proposition.

So, automating value lever number one is ‘power to the people’. Consider that when choosing what to automate first and how much to invest in automating that thing. It doesn’t matter how “cool” or “powerful” a concept is if it doesn’t help the masses in your organization. This should be self-evident, but you still hear people waxing on about how much faster things start in Docker. A few questions later and you figure out that they only actually start those based on one of their ops guys getting an open item through an IT ticketing system from 2003…

The “Other” Value Levers of Automation – Part 1 – Introduction

Much of the automation discussion in DevOps focuses on speed or, if a more enlightened conversation, throughput. That is unfortunate, because it values a narrow dimension of automation without considering what makes things actually faster. That narrowness means that underlying inefficiencies can be masked by raw power. Too often that results in a short-term win followed by stagnation and an inability to improve farther. Sometimes it even creates a win that crumbles after a short period and results in a net loss. All of which translate to a poor set of investment priorities. These next few posts will look at why automation works – the core, simple levers that make it valuable – in an attempt to help people frame their discussion, set their priorities, and make smart investments in automation as they move their DevOps initiatives forward.

To be fair, let’s acknowledge that there is a certain sexiness about speed that can distract from other important conversations. Speed is very marketable as we have seen that for ages in the sports car market. Big horsepower numbers and big top speeds always get big glowing headlines, but more often than not, the car that is the best combined package will do best around a track even if it does not have the highest top speed. As engines and power have become less of a distinguishing factor (there are a surprising number of 200+mph cars on the market now), people are figuring that out. Many top-flight performance cars have begun to talk about their ‘time around the Nurburgring’ (a legendary racetrack in Germany) as a more holistic performance measure rather than just focusing on ‘top speed’.

The concept of the complete package being more important is well understood in engineering-driven automobile racing and winning teams have always used non-speed factors to gain competitive advantage. For example, about 50 years ago, Colin Chapman the founder of Lotus cars and designer of World Championship winning racing cars, famously said that “Adding power makes you faster on the straights. Subtracting weight makes you faster everywhere”. In other words, raw speed and top speed are important, but, if you have to slow down to turn you are likely to be beaten. Currently, Audi dominates endurance racing by running diesel powered cars that have to make fewer pitstops than the competition. They compete on the principle that even the fastest racecar is very beatable when it is sitting still being fueled.

So, since balance is an important input in achieving the outcomes of speed and throughput, I think we should look at some of the balancing value levers of automation in a DevOps context. In short, a discussion of the more subtle reasons _why_ automation takes center stage when we seek to eliminate organizational silos and how to balance the various factors.

The three non-speed value levers of automation that we will look at across the next few posts are:
1 – Ability to ‘democratize’ expertise
2 – Ability to automate delegation
3 – Traceability

Ops Heroes are NOT Qualified to do Anything with Nothing

There is a certain “long-suffering and misunderstood” attitude that shows up a lot in Operations. I have seen this quote on a number of cube walls:

We the willing, 
led by the unknowing, 
are doing the impossible 
for the ungrateful. 

We have done so much, 
with so little, 
for so long, 
we are now qualified to do anything, 
with nothing. 

Note: This quote is often mistakenly attributed to Mother Teresa. It was actually from this other guy called Konstantin Josef Jireček that no one has heard of recently.

The problem, of course, is that this attitude is counter-productive in a DevOps world. It promotes the culture that operations will ‘get it done’ no matter what how much is thrown their way in terms of budget cuts, shortened timeframes, uptime expectations, etc. It is a great and validating thing in some ways – you pulled off the impossible and get praise heaped on you. It is really the root of defective ‘hero culture’ behaviors that show up in tech companies or tech departments. And no matter how many times we write about the defectiveness of hero culture in a sustained enterprise, the behavior persists due to a variety of larger societal attitudes.

If you have seen (or perpetuated) such a culture, do not feel too bad – aspects of it show up in other disciplines including medicine. There is a fascinating discussion of this – and the cultural resistance to changing the behaviors – in Atul Gawande’s book, The Checklist Manifesto. The book is one of my favorites of the last couple of years. It discusses the research Dr (yes – he is a surgeon himself) Gawande did on why the instance of complications after surgery was so high relative to other high-criticality activities. He chose aviation – which is a massively complex and yet very precise, life-critical industry. It also has a far better record of incident free activity relative to the more intimate and expertise-driven discipline of medicine. The book proceeds to look at the evolution of the cultures of both industries and how one developed a culture focused on the surgeon being omniscient and expert in all situations while the other created an institutional discipline that seeks to minimize human fallibility in tense situations.

He further looks into the incentives surgeons have – because they have a finite number of hours in the day – to crank through procedures as quickly as possible. That way they generate revenue and do not tie up scarce and expensive operating rooms. But surgeons really can only work so fast and procedures tend to take as long as they do for a given patient’s situation. Their profession is manual and primarily scales based on more people doing more work. Aviation exploits the fact that it deals with machines and has more potential for instrumentation and automation.

The analogy is not hard to make to IT Operations people having more and more things to administer in shorter downtime windows. IT Operations culture, unfortunately, has much more in common with medicine than it does with aviation. There are countless points in the book that you should think about the next time you are logged in with root or equivalent access and about to manually make a surgical change… What are you doing to avoid multitasking? What happens if you get distracted? What are you doing to leverage/create instrumentation – even something manual like a checklist – to ensure your success rate is better each time? What are you doing to ensure that what you are doing can be reproduced by the next person? It resonates…

The good news is that IT Operations as a discipline (despite its culture) deals with machines. That means it is MUCH easier to create tools and instrumentation that leverage expertise widely while at the same time improving the consistency with which tasks are performed. Even so, I have heard only a few folks mention it at DevOps events and that is unfortunate, because the basic discipline of just creating good checklists – and the book discusses how – is a powerful and immediately adoptable thing that any shop, regardless of platform, toolchain, or history can adopt and readily benefit from. It is less inspirational and visionary than The Phoenix Project,  but it is one of the most practical approaches of working toward that vision that exists.

The book is worth a read – no matter how DevOps-y your environment is or wants to be. I routinely recommend it to our junior team members as a way to help them learn to develop sustainable disciplines and habits. I have found this to be a powerful tool for managing overseas teams, too.

I would be interested in anyone’s feedback who is using checklist techniques – particularly as an enhancement / discipline roadmap in a DevOps shop. I have had some success wrapping automation and instrumentation (as well as figuring out how to prioritize where to add automation and instrumentation) by building checklists for things and would love to talk about it with others who are experimenting with it.

What Makes a Good DevOps Tool?

We had an interesting discussion the other day about what made a “good” DevOps tool.  The assertion is that a good citizen or good “link” in the toolchain has the same basic attributes regardless of the part of the system for which it is responsible. As it turns out, at least with current best practices, this is a reasonably true assertion.  We came up with three basic attributes that the tool had to fit or it would tend to fall out of the toolchain relatively quickly. We got academic and threw ‘popular’ out as a criteria – though supportability and skills availability has to be a factor at some point in the real world. Even so, most popular tools are at least reasonably good in our three categories.

Here is how we ended up breaking it down:

  1. The tool itself must be useful for the domain experts whose area it affects.  Whether it be sysadmins worried about configuring OS images automatically, DBAs, network guys, testers, developers or any of the other potential participants, if the tool does not work for them, they will not adopt it.  In practice, specialists will put up with a certain amount of friction if it helps other parts of the team, but once that line is crossed, they will do what they need to do.  Even among development teams, where automation is common for CI processes, I STILL see shops where they have a source control system that they use day-to-day and then promote from that into the source control system of record.  THe latter was only still in the toolchain due to a bureaucratic audit requirement.
  2. The artifacts the tool produces must be easily versioned.  Most often, this takes the form of some form of text-based file that can be easily managed using standard source control practices. That enables them to be quickly referenced and changes among versions tracked over time. Closed systems that have binary version tracking buried somewhere internally are flat-out harder to manage and too often have layers of difficulty associated with comparing versions and other common tasks. Not that it would have to be a text-based artifact per se, but we had a really hard time coming up with tools that produced easily versioned artifacts that did not just use good old text.
  3. The tool itself must be easy to automate externally.  Whether through a simple API or command line, the tool must be easily inserted into the toolchain or even moved around within the toolchain with a minimum of effort. This allows quickest time to value, of course, but it also means that the overall flow can be more easily optimized or applied in new environments with a minimum of fuss.

We got pretty meta, but these three aspects showed up for a wide variety of tools that we knew and loved. The best build tools, the best sysadmin tools, even stuff for databases had these aspects. Sure, this is proof positive that the idea of ‘infrastructure as code’ is still very valid. The above apply nicely to the most basic of modern IDEs producing source code. But the exercise became interesting when we looked at older versus newer tools – particularly the frameworks – and how they approached the problem. Interestingly we felt that some older, but popular, tools did not necessarily pass the test.  For example, Hudson/Jenkins are weak on #2 and #3 above.  Given their position in the toolchain, it was not clear if it mattered as much or if there was a better alternative, but it was an interesting perspective on what we all regarded as among the best in their space.

This is still an early thought, but I thought I would share the thought to see what discussion it would stimulate. How we look at tools and toolchains is evolving and maturing. A tool that is well loved by a particular discipline but is a poor toolchain citizen may not be the right answer for the overall organization. A close second that is a better overall fit might be a better answer. But, that goes against the common practice of letting the practitioners use what they feel best for their task. What do you do? Who owns that organizational strategic call? We are all going to have to figure that one out as we progress.

DevOps is about Building Fords, not Ferraris

There is an interesting obsession with having the ‘ultimate’ of whatever you’re talking about. This applies to most things in our society: jobs, houses, televisions, cars. You name it, there is an ‘ultimate’ version that everyone aspires to have. There is a lot of good to this behavior, to be sure. I believe strongly that everyone should be trying to get better all the time. Though I would point out that it is healthier to regard the ultimate [whatever] as a consequence or benefit of getting better rather than an end unto itself.

But it’s usually bad to want the ‘ultimate’ in your software delivery process. Goldplating has always been an enemy in software projects and there is evidence of it in how a lot of organizations have traditionally delivered software. It usually shows up in the culture, where high-intervention processes lead to hero cults and aspirations to be the ultimate ‘hero’ who gets releases out the door. Old-school, old-world hand craftsmanship is the order of the day. DevOps is the exact opposite of this approach. It focuses on a highly repeatable, scalable, and mass-produced approach to releasing software. And frequently.

Which brings me back to the contrast between a Ferrari and a Ford. A Ferrari is pretty much the ultimate sports car and ultimate sports car brand. There really is very little not to like. But the cars are exotics still built with expensive materials using manual, old world techniques. To be fair, Ferrari has a super-modern robotic process for a lot of their precision work, but they add a lot of customization and hand-finishing. And they ship a very few thousand releases (cars) each year. Sustaining such a car in the real world involves specially trained mechanics named Giuseppe, long waits for parts from Italy, and even shipping the car across the state if you don’t live close to a qualified shop. No biggie – if you can afford the car, you can afford the maintenance. But, let’s face it, they are a ‘money is no object’ accessory.

Ford has shipped a variety of performance models over the years based on the Mustang platform. In fact, there have been years where Ford has shipped more performance Mustangs in a week than Ferrari would ship cars in that YEAR. And there is a magic there for a DevOps geek. Plain ol’ Ford Motor Company has started selling a 200mph Mustang this year for about $60K. There’s nothing too exotic about it. You can go to your local Ford dealer and buy it. It can be purchased at one dealer and serviced at any other dealer anywhere in the country. Parts? No problem – most of them are in local warehouses stationed strategically so that no dealer would have to keep a customer waiting too long for common items. A lot of stuff can be had from your local AutoZone because, well, it’s “just” a Mustang.

The lesson, though, is that Ford has an economy of scale by virtue of the volume of Mustangs it produces. No, a Mustang is not as nice or as custom as a Ferrari. It is as common and mass-produced as anything. But a 200mph car that anyone can buy for noticeably less than a house, get parts easily, and have serviced at thousands of locations is an amazing and magical thing. It teaches a solid lesson about scalability and sustainability that should be inspirational for DevOps teams.

And maybe, just maybe, if your company does a good enough job at sustainably delivering your software, you might be able to afford that Ferrari someday…

PS – for Chevy zealots. I realize the Corvette cleared 200 on a “volume” platform first. But the 200mph Plastic Fantastic looks more exotic relative to the Mustang – which has a plain “sporty commuter” or even rental fleet version with a V6. And the common example of the economies of scale mean that the 200mph Shelby Mustang is still a bargain relative to the 200mph capable ‘vette, which is the point of this post.

DevOps DARPA Style

I think spending a lot of time on DevOps may skew my interpretation of different trends and articles.  To me it seems that everyone is trying to reinvent and “lean out” there design to engineering flow to be faster, more iterative, and generally more responsive to conditions in our rapidly changing world.  Faster is, of course, relative depending on what you are talking about.  I recently saw this article on the MIT Technology Review about DARPA (always a source of cool advanced engineering ideas) undertaking a rapid approach for getting a new tank designed and built.

Article here:  http://www.technologyreview.com/news/509311/darpa-wants-to-remake-manufacturing/

The article thematically addresses concepts like ensuring a common understanding of the design among contributing engineers and moving manufacturing knowledge closer to the design stage so it is actually a part of the design thinking.  My DevOps skew made the immediate association of how similar this was to the collaboration implicit with Agile and DevOps.  Everyone needs to know the architecture and Ops needs to be involved directly with development while development is underway to ensure rapid Continuous Delivery cycles.  It’s a good perspective on how applicable these concepts are on a much broader scale and in varied industries.

I figure that if these guys can do it with metal in the context of a tank, it has to be possible with whatever software or virtualization problem I”m dealing with.  Though it does make me want better toys for our office.  I have to believe that DARPA has cooler Nerf guns…

Slides from Agile Austin Talk 2-12-2013

This post slides from my talk at Agile Austin on February 12, 2013.

I want to thank the group for the opportunity and the audience for the great interaction!

Link to slides:  DevOps Beyond the Basics – FINAL

A System for Changing Systems – Part 9

The last capability area in the framework is that of Monitoring. I saved this for last because it is the one that tends to be the most difficult to get right. Of course, commensurate with the difficulty is the benefit gained when it is working properly. A lot of the difficulty and benefit with Monitoring comes from the fact that knowing what to look at, when to look at it and what NOT to look at are only the first steps. It also becomes important to know what distributed tidbits of information to bring together if you actually want a complete picture of your application environment.

Monitoring

Monitoring Capability Area

This post could go for pages – and Monitoring is likely going to be a consuming topic as this series progresses, but for the sake of introduction, lets look at the Monitoring capability area. The sub-capabilities for this area encompass the traditional basics of monitoring Events and Trends among them. The challenge for these two is in figuring out which Events to monitor and sometimes how to get the Event data in the first place. The Trends must then be put into a Report format that resonates with management. It is important to invest in this area in order to build trust with management that the team has control as it tries to increase the frequency of changes – without management’s buy-in, they won’t fund the effort. Finally, the Correlation sub-capability area is related to learning about the application system’s behavior and how changes to some part of the system impacts the other parts. This is an observational knowledge base that must be deliberately built by the team over time so that they can put the Events, Trends, and Reports into the most useful contexts and use the information to better understand risks and priorities when making changes to the system.

A System for Changing Systems – Part 8

The fourth capability area is that of Provisioning. It covers the group of activities for creating all or part of an environment in which an application system can run. This is a key capability for ensuring that application systems have the capacity they need to maintain performance and availability. It is also crucial for ensuring that development and test activities have the capacity they need to maintain THEIR performance. The variance with test teams is that a strong Provisioning capability also ensures that development and test teams can have clean dev/test environments that are very representative of prorduction environments and can very quickly refresh those dev/test environments as needed. The sub-capabilities here deal with managing the consistency of envionment configurations, and then quickly building environments to a known state.

Provisioning Capability Area

Provisioning Capability Area

The fifth capability area is closely related to Provisioning. It is the notion of a System Registry capability. This set of capabilities deals with delivering the assumed infrastructure functions (e.g. DNS, e-mail relays, IP ranges, LDAP, etc.) that surround the environments. These capabilities must be managed in such a way that one or more changes to an application system can be added to a new or existing environment with out significant effort or disruption. In many ways this capability area is the fabric in which the others operate. It can also be tricky to get right because this capability area often spans multiple application systems.

System Registry Capability Area

System Registry Capability Area

A System for Changing Systems – Part 7 – Deployment Capabilities

The third capability area is that of Deployment. Deployment deals with the act of actually putting the changes into a given target environment. It is not prescritive of how this happens. Many shops mechanically deal with deployment via their provisioning system. That is obviously a good thing and an efficiency gain by removing a discrete system for performing deployment activities. It is really a best practice of the most mature organizations. However, this taxonomy model is about identifying the capabilities needed to consistently apply changes to a whole application system. And, lets face it, best practices tend to be transient; as new, even better, best practices emerge.

Deployment Capability Area

Deployment Capability Area

Additionally, there are a number of reasons the capability is included in this taxonomy. First of all, the framework is about capabilities rather than technologies or implementations. It is important to be deliberate about how changes are deployed to all environments and simply because some group of those changes are handled by a provisioning tool does not remove the fact that not all are covered nor does it remove the fact that some deliberate work is expended in fitting the changes into the provisioning tool’s structure. Most provisioning tools, for example are set up to handle standard package mechanisms such as RPM. The deployment activity in that scenario is more one of packaging the custom changes. But the provisioning answer is not necessarily a solution for all four core areas of an applpication system, so there needs to be a capability that deals generically with all of them. Finally, many, if not most, shops have some number of systems where there are legacy technical requirements that require deployment to happen separately.

All of that being true, the term “Deployment” is probably confusing given its history and popular use. It will likely be replaced in the third revision of this taxonomy with something more generic, such as “Change Delivery”.

The sub-category of Asset Repository refers to the fact that there needs to be an ability to maintain a collection of changes that can be applied singly or in bulk to a given application system. In the third revision of the taxonomy, it is likely to be joined by a Packaging sub-capability.  Comments and thoughts are welcome as this taxonomy is evolving and maturing along with the DevOps movement.