Your Deployment Doc Might Not be Useful for DevOps

One of the most common mistakes I see people making with automation is the assumption that they can simply wrap scripts around what they are doing today and be ‘automated’. The assumption is based around some phenomenally detailed runbook or ‘deployment document’ that has every command that must be executed. In ‘perfect’ sequence. And usually in a nice bold font. It was what they used for their last quarterly release – you know, the one two months ago? It is also serving as the template for their next quarterly release…

It’s not that these documents are bad or not useful. They are actually great guideposts and starting points for deriving a good automated solution to releasing software in their environment. However, you have to remember that these are the same documents that are used to guide late night, all hands, ‘war room’ deployments. The idea that their documented procedures are repeatablly automate-able is suspect, at best, based on that observation alone.

Deployment documents break down as an automate-able template for a number of reasons. First, there are almost always some number of undocumented assumptions about the state of the environment before a release starts. Second, using the last one does not account for procedural, parameter, or other changes between the prior and the upcomming releases. Third, the processes usually unconsciously rely on interpretation or tribal knowledge on the part of the person executing the steps. Finally, there is the problem that steps that make sense in a sequential, manual process will not take advantage of the intrinsic benefits of automation, such as parallel execution, elimination of data entry tasks, and so on.

The solution is to never set the expectation – particularly to those with organizational power – that the document is only a starting point. Build the automation iteratively and schedule multiple iterations at the start of the effort. This can be a great way to introduce Agile practices into the traditionally waterfall approaches used in operations-centric environments. This approach allows for the effort that will be required to fill in gaps in the document’s approach, negotiate standard packaging and tracking of deploy-able artifacts, add environment ‘config drift’ checks, or any of the other common ‘pitfall’ items that require more structure in an automated context.

This article is also on LinkedIn here: https://www.linkedin.com/pulse/your-deployment-doc-might-useful-devops-dan-zentgraf

Advertisements

Automation and the Definition of Done

“Done” is one of the more powerful concepts in human endeavor. Knowing that something is “done“ enables us to move on to the next endeavor, allows us to claim compensation, and sends a signal to others that they can begin working with whatever we have produced. However, assessing done can be contentious – particularly where the criteria are undefined. This is why the ‘definition of done’ is a major topic in software delivery. Software development has a creative component that can lead to confusion and even conflict. It is not a trivial matter.

Automation in the software delivery process forces the team to create a clear set of completion criteria early in the effort, thus reducing uncertainty around when things are ‘done’ as well as what happens next. Though they at first appear to be opposites, with done defining a stopping point and automation being much more about motion, the link between ‘done’ and automation is synergistic. Being good at one makes the team better at the other and vice-versa. Being good at both accelerates and improves the team’s overall capability to deliver software.

A more obvious example of the power of “done” appears in the Agile community. For example, Agile teams often have a doctrine of ‘test driven development’ where developers should write the tests first. Further examples include the procedural concepts for completing the User Stories in each iteration, or sprint, so that the team can clearly assess completion in the retrospective. Independent of these examples, validation-centric scenarios are an obvious area where automation can help underpin “done”. In the ‘test-driven development’ example, test suites that run at various points provide unambiguous feedback over whether more work is required. Those test suites become part of the Continuous Integration (CI) job so that every time a developer commits new code. If those pass, then the build automatically deploys into the integration environment for further evaluation.

Looking a bit deeper at the simple process of automatically testing CI builds reveals how automation forces a more mature understanding of “done”. Framed another way, the fact that the team has decided to have that automated assessment means that they have implicitly agreed to a set of specific criteria for assessing their ‘done-ness’. That is a major step for any group and evidence of significant maturation of the overall team.

That step of maturation is crucial, as it enables better flows across the entire lifecycle. For example, understanding how to map ‘done-ness’ into automated assessment is what enables advanced delivery methodologies such as Continuous Delivery. Realistically, any self-service process, whether triggered deliberately by a button push or autonomously by an event, such as delivering code, cannot exist without a clear, easily communicated, understanding of when that process is complete and how successful it was. No one would trust the automation were it otherwise.

There is an intrinsic link between “Done” and automation. They are mutual enablers. Done is made clearer, easier and faster by automation. Automation, in turn, forces a clear definition of what it means to be complete, or ‘done’. The better the software delivery team is at one, the stronger that team is at the other.

 

This article is also on LinkedIn here: https://www.linkedin.com/pulse/automation-definition-done-dan-zentgraf

Lean Software Development and Automation

Over the past decade or more, the interest in Agile and Lean topics has grown substantially as businesses have come to see software as a key part of their brand, their customer experience, and their competitive advantage. Understanding the principles at a basic level is relatively straightforward – a testament to how well they are articulated and thought out – however, executing on them can be difficult. One of the key tools in successfully executing around the vision of Lean is exploiting the power of automation. A frequent source of confusion is that automation itself is not a goal – rather, it is a very powerful means to achieving the goal. To clarify the point of automation helper rather than end, this post will look at the Principles of Lean Software Development as defined by Tom and Mary Poppendieck in their seminal work “Lean Software Development: An Agile Toolkit” (2003) and some ways automation enables them.

Principles of Lean Software Development

1- Principle: Eliminate waste

Tracing its way all the way back to the core of Lean manufacturing, this principle is about eliminating unnecessary work, such as fixes, infrequently used features, low-priority requirements, etc. that does not translate to actual customer value. In many ways, this principle underpins all of the others as guiding context for why they are valuable and important in their own right.

Automation carries both direct and indirect value for eliminating waste. In direct terms, it simply cuts cycle times by doing things faster relative to doing them manually. The indirect value is that speed enables the shorter feedback loops and the amplified learning that allow the team to make better decisions faster. Between the direct and indirect value, it is easy to see why there is so much focus on automation among the Lean, Agile and DevOps movements – it is at the core of waste elimination.

2 – Principle: Build Quality In

The notion of ‘building quality in’ deals with the point that it is fundamentally more efficient and cost-effective to build good code from the beginning than to try to ‘test quality in’ later. Testing late in the cycle, even though seen as a norm for years, is actually devastating to software projects. For starters, the churn of constantly fixing things is waste. Further, the chances of introducing all new problems (and thus extending expensive test cycles) increases with each cycle. Finally, the impact to schedules and feature work can be very damaging to customer value. There are numerous other problems as well.

The developers building the system need to have the proper facilities for ensuring that the code they have written actually meets the standard. Otherwise, they are relying on downstream tests to put the quality in and have thus violated this principle. Since the developers are human and their manual work, including any manual testing they might do, is therefore relatively slow error prone, automation is the only practical answer for ensuring they can validate their work while they are still working on it. This takes the form of techniques like Continuous Integration, automated tests during build cycles, and automated test suites for the integrated system once the changes have passed their unit tests. Automation provides the speed and consistency required to operate with such a high level of discipline and quality of work.

3 – Principle: Create Knowledge

This principle, sometimes written as “Amplify Learning”, addresses the point that the act of building something teaches everyone involved new ways of looking at both the original problem as well as what the best solution would be. Classic ‘omniscient specification’ at the beginning of a project carries the bizarre assumption that the specifier knows and understands all aspects of both problem and solution before writing the first word of the specification. This is obviously very unlikely at best. Lean and Agile address how the team quickly and continuously seeks out this learning, distributes the knowledge to all stakeholders, and takes action to adjust activities based on the new understanding. This behavior is one of the core maxims that really delivers the ‘agility’ in Agile.

Automation, as we have seen, provides speed and consistency that is not otherwise available. These capabilities serve to create knowledge by enabling the faster, easier collection of data. The data might be technical, in the form of the test results mentioned above as part of “Build Quality In”. A more advanced scenario might be more frequent value assessment – achieved by giving the business owners an easy facility for seeing a completed, or nearly completed, feature sooner – in order to validate the implementation before it is final. Even more advanced variants involve techniques such as “Canary deployments” or A/B testing – in which a limited audience of live customers receives early versions of features in order to analyze their response.

4 – Principle: Defer Commitment

The Defer Commitment principle addresses the point that teams would not want to take a design direction that they later learn was a fundamental ‘dead-end’. This principle is a response to the impact of knowledge creation (Principle #3 above). By delaying decisions that are hard to reverse, the team can dramatically reduce the risk of hitting a ‘dead end’ that might cause expensive rework or even kill the project.

Automation as applied to this principle also reflects the tight relationship to “Create Knowledge“. By exploiting the ability to collect more knowledge faster, and with a more complete context, teams can ensure they have the most thorough set of information possible before making a hard-to-reverse decision. Fast cycle time can also enable experimental scenarios that would not otherwise be possible. Promising architectural ideas can be prototyped in a more realistic running state because it is not too hard or time consuming to do so. This opens the team up to new and potentially better solutions, which would otherwise might have been too risky. Whether about a particular feature, architectural point, or design element, automation enables the team to ensure that it has real data from many deployment and test cycles before committing.

5 – Principle: Deliver Fast

The principle of delivering fast uses the fact that short cycle times mean less waste, more customer feedback, and more learning opportunities. A fast cycle will generally have less waste because there will be less wait time and less unfinished work at any given point in time. The ability to deliver quickly also means that more frequent customer feedback, which, as we have discussed, reduces waste and risk while increasing knowledge. Finally, delivering quickly will cause the team to focus on finishing a smaller number of features for each delivery rather than leaving many undone for a long period.

As has been described, speed is a common byproduct of automation. Getting working code into the hands of stakeholders is a key part of every approach to enabling business agility. In the case of applying automation’s speed to this principle, however, it is better to think in terms of frequency. Speed and frequency are closely related factors, of course – a long cycle time implies less frequent delivery and vice-versa. The point is simply that without automation, frequency will always be much lower. That means less feedback, less learning, and less knowledge for the team to use.

6 – Principle: Respect People

Beyond the obvious human aspects of this point, this principle is actually very pragmatic. The people closest to the work are the ones who know it best. They are best equipped to identify and solve challenges as they come up. Squashing their initiative to do so will diminish the team’s effectiveness and, through missed opportunities over time, the cost-effectiveness and value of the software itself.

In the previous principles, there are numerous statements about giving new capabilities to the team. This principle deals with how automation empowers the members of the team. Indeed, the alternate expression of this principle is “Empower the Team”. That phrasing gets to the crux of how automation does or does not respect the people. Automation itself cannot show respect, but how it is deployed most certainly can. For example, the contrast between a self-service facility that anyone on the team can use at any time and a similar facility for which individuals must ask permission each time they use it speaks volumes about the respect the organization has for the team’s professionalism. It will also drive behavior and self-discipline as the team matures. Consider how the practice in high-maturity Continuous Delivery scenarios has a direct, automatic path from check-in to production while so many shops still require multiple sign-offs. Which team is more likely to be effective, innovative, and efficient?

7 – Principle: Optimize the whole

This principle really focuses on how all of the principles are interrelated across the whole lifecycle. Other management theories, such as the “Theory of Constraints” address this with statements such as ‘you can never be faster than the slowest step’. This principle continues the theme of continuous learning and adjustment that pervades Lean thinking. It deals with the fact that in order to take time and waste out of a system, you need to understand its goal and then continuously and deliberately eliminate the root cause of the largest bottlenecks that prevent the most efficient realization of that goal.

The core of this principle is to optimize delivery of value to the customer – effectively starting with the value and working backward to the start of the process. Automation as a tool in that effort makes optimization substantially easier. When starting with a process that has manual steps, the very act of automating a process is an optimization by itself. Until the entire process flows from end to end with automation, the manual phases will be the more obvious bottlenecks. Then, once the automation spans the whole flow, the automation itself generates metrics for further improvement in cycle time and efficiency in pursuit of delivering value to the customer.

That optimization effort of the last principle takes the discussion somewhat circularly back to the first principle, which is to eliminate waste. That is quite appropriate. Given how interrelated all of these principles are, the discussion of the contributions of automation to them should be similarly interrelated.

This post is also on LinkedIn here: https://www.linkedin.com/pulse/lean-software-development-automation-dan-zentgraf

Ops Heroes are NOT Qualified to do Anything with Nothing

There is a certain “long-suffering and misunderstood” attitude that shows up a lot in Operations. I have seen this quote on a number of cube walls:

We the willing, 
led by the unknowing, 
are doing the impossible 
for the ungrateful. 

We have done so much, 
with so little, 
for so long, 
we are now qualified to do anything, 
with nothing. 

Note: This quote is often mistakenly attributed to Mother Teresa. It was actually from this other guy called Konstantin Josef Jireček that no one has heard of recently.

The problem, of course, is that this attitude is counter-productive in a DevOps world. It promotes the culture that operations will ‘get it done’ no matter what how much is thrown their way in terms of budget cuts, shortened timeframes, uptime expectations, etc. It is a great and validating thing in some ways – you pulled off the impossible and get praise heaped on you. It is really the root of defective ‘hero culture’ behaviors that show up in tech companies or tech departments. And no matter how many times we write about the defectiveness of hero culture in a sustained enterprise, the behavior persists due to a variety of larger societal attitudes.

If you have seen (or perpetuated) such a culture, do not feel too bad – aspects of it show up in other disciplines including medicine. There is a fascinating discussion of this – and the cultural resistance to changing the behaviors – in Atul Gawande’s book, The Checklist Manifesto. The book is one of my favorites of the last couple of years. It discusses the research Dr (yes – he is a surgeon himself) Gawande did on why the instance of complications after surgery was so high relative to other high-criticality activities. He chose aviation – which is a massively complex and yet very precise, life-critical industry. It also has a far better record of incident free activity relative to the more intimate and expertise-driven discipline of medicine. The book proceeds to look at the evolution of the cultures of both industries and how one developed a culture focused on the surgeon being omniscient and expert in all situations while the other created an institutional discipline that seeks to minimize human fallibility in tense situations.

He further looks into the incentives surgeons have – because they have a finite number of hours in the day – to crank through procedures as quickly as possible. That way they generate revenue and do not tie up scarce and expensive operating rooms. But surgeons really can only work so fast and procedures tend to take as long as they do for a given patient’s situation. Their profession is manual and primarily scales based on more people doing more work. Aviation exploits the fact that it deals with machines and has more potential for instrumentation and automation.

The analogy is not hard to make to IT Operations people having more and more things to administer in shorter downtime windows. IT Operations culture, unfortunately, has much more in common with medicine than it does with aviation. There are countless points in the book that you should think about the next time you are logged in with root or equivalent access and about to manually make a surgical change… What are you doing to avoid multitasking? What happens if you get distracted? What are you doing to leverage/create instrumentation – even something manual like a checklist – to ensure your success rate is better each time? What are you doing to ensure that what you are doing can be reproduced by the next person? It resonates…

The good news is that IT Operations as a discipline (despite its culture) deals with machines. That means it is MUCH easier to create tools and instrumentation that leverage expertise widely while at the same time improving the consistency with which tasks are performed. Even so, I have heard only a few folks mention it at DevOps events and that is unfortunate, because the basic discipline of just creating good checklists – and the book discusses how – is a powerful and immediately adoptable thing that any shop, regardless of platform, toolchain, or history can adopt and readily benefit from. It is less inspirational and visionary than The Phoenix Project,  but it is one of the most practical approaches of working toward that vision that exists.

The book is worth a read – no matter how DevOps-y your environment is or wants to be. I routinely recommend it to our junior team members as a way to help them learn to develop sustainable disciplines and habits. I have found this to be a powerful tool for managing overseas teams, too.

I would be interested in anyone’s feedback who is using checklist techniques – particularly as an enhancement / discipline roadmap in a DevOps shop. I have had some success wrapping automation and instrumentation (as well as figuring out how to prioritize where to add automation and instrumentation) by building checklists for things and would love to talk about it with others who are experimenting with it.

What Makes a Good DevOps Tool?

We had an interesting discussion the other day about what made a “good” DevOps tool.  The assertion is that a good citizen or good “link” in the toolchain has the same basic attributes regardless of the part of the system for which it is responsible. As it turns out, at least with current best practices, this is a reasonably true assertion.  We came up with three basic attributes that the tool had to fit or it would tend to fall out of the toolchain relatively quickly. We got academic and threw ‘popular’ out as a criteria – though supportability and skills availability has to be a factor at some point in the real world. Even so, most popular tools are at least reasonably good in our three categories.

Here is how we ended up breaking it down:

  1. The tool itself must be useful for the domain experts whose area it affects.  Whether it be sysadmins worried about configuring OS images automatically, DBAs, network guys, testers, developers or any of the other potential participants, if the tool does not work for them, they will not adopt it.  In practice, specialists will put up with a certain amount of friction if it helps other parts of the team, but once that line is crossed, they will do what they need to do.  Even among development teams, where automation is common for CI processes, I STILL see shops where they have a source control system that they use day-to-day and then promote from that into the source control system of record.  THe latter was only still in the toolchain due to a bureaucratic audit requirement.
  2. The artifacts the tool produces must be easily versioned.  Most often, this takes the form of some form of text-based file that can be easily managed using standard source control practices. That enables them to be quickly referenced and changes among versions tracked over time. Closed systems that have binary version tracking buried somewhere internally are flat-out harder to manage and too often have layers of difficulty associated with comparing versions and other common tasks. Not that it would have to be a text-based artifact per se, but we had a really hard time coming up with tools that produced easily versioned artifacts that did not just use good old text.
  3. The tool itself must be easy to automate externally.  Whether through a simple API or command line, the tool must be easily inserted into the toolchain or even moved around within the toolchain with a minimum of effort. This allows quickest time to value, of course, but it also means that the overall flow can be more easily optimized or applied in new environments with a minimum of fuss.

We got pretty meta, but these three aspects showed up for a wide variety of tools that we knew and loved. The best build tools, the best sysadmin tools, even stuff for databases had these aspects. Sure, this is proof positive that the idea of ‘infrastructure as code’ is still very valid. The above apply nicely to the most basic of modern IDEs producing source code. But the exercise became interesting when we looked at older versus newer tools – particularly the frameworks – and how they approached the problem. Interestingly we felt that some older, but popular, tools did not necessarily pass the test.  For example, Hudson/Jenkins are weak on #2 and #3 above.  Given their position in the toolchain, it was not clear if it mattered as much or if there was a better alternative, but it was an interesting perspective on what we all regarded as among the best in their space.

This is still an early thought, but I thought I would share the thought to see what discussion it would stimulate. How we look at tools and toolchains is evolving and maturing. A tool that is well loved by a particular discipline but is a poor toolchain citizen may not be the right answer for the overall organization. A close second that is a better overall fit might be a better answer. But, that goes against the common practice of letting the practitioners use what they feel best for their task. What do you do? Who owns that organizational strategic call? We are all going to have to figure that one out as we progress.

DevOps is about Building Fords, not Ferraris

There is an interesting obsession with having the ‘ultimate’ of whatever you’re talking about. This applies to most things in our society: jobs, houses, televisions, cars. You name it, there is an ‘ultimate’ version that everyone aspires to have. There is a lot of good to this behavior, to be sure. I believe strongly that everyone should be trying to get better all the time. Though I would point out that it is healthier to regard the ultimate [whatever] as a consequence or benefit of getting better rather than an end unto itself.

But it’s usually bad to want the ‘ultimate’ in your software delivery process. Goldplating has always been an enemy in software projects and there is evidence of it in how a lot of organizations have traditionally delivered software. It usually shows up in the culture, where high-intervention processes lead to hero cults and aspirations to be the ultimate ‘hero’ who gets releases out the door. Old-school, old-world hand craftsmanship is the order of the day. DevOps is the exact opposite of this approach. It focuses on a highly repeatable, scalable, and mass-produced approach to releasing software. And frequently.

Which brings me back to the contrast between a Ferrari and a Ford. A Ferrari is pretty much the ultimate sports car and ultimate sports car brand. There really is very little not to like. But the cars are exotics still built with expensive materials using manual, old world techniques. To be fair, Ferrari has a super-modern robotic process for a lot of their precision work, but they add a lot of customization and hand-finishing. And they ship a very few thousand releases (cars) each year. Sustaining such a car in the real world involves specially trained mechanics named Giuseppe, long waits for parts from Italy, and even shipping the car across the state if you don’t live close to a qualified shop. No biggie – if you can afford the car, you can afford the maintenance. But, let’s face it, they are a ‘money is no object’ accessory.

Ford has shipped a variety of performance models over the years based on the Mustang platform. In fact, there have been years where Ford has shipped more performance Mustangs in a week than Ferrari would ship cars in that YEAR. And there is a magic there for a DevOps geek. Plain ol’ Ford Motor Company has started selling a 200mph Mustang this year for about $60K. There’s nothing too exotic about it. You can go to your local Ford dealer and buy it. It can be purchased at one dealer and serviced at any other dealer anywhere in the country. Parts? No problem – most of them are in local warehouses stationed strategically so that no dealer would have to keep a customer waiting too long for common items. A lot of stuff can be had from your local AutoZone because, well, it’s “just” a Mustang.

The lesson, though, is that Ford has an economy of scale by virtue of the volume of Mustangs it produces. No, a Mustang is not as nice or as custom as a Ferrari. It is as common and mass-produced as anything. But a 200mph car that anyone can buy for noticeably less than a house, get parts easily, and have serviced at thousands of locations is an amazing and magical thing. It teaches a solid lesson about scalability and sustainability that should be inspirational for DevOps teams.

And maybe, just maybe, if your company does a good enough job at sustainably delivering your software, you might be able to afford that Ferrari someday…

PS – for Chevy zealots. I realize the Corvette cleared 200 on a “volume” platform first. But the 200mph Plastic Fantastic looks more exotic relative to the Mustang – which has a plain “sporty commuter” or even rental fleet version with a V6. And the common example of the economies of scale mean that the 200mph Shelby Mustang is still a bargain relative to the 200mph capable ‘vette, which is the point of this post.

DevOps DARPA Style

I think spending a lot of time on DevOps may skew my interpretation of different trends and articles.  To me it seems that everyone is trying to reinvent and “lean out” there design to engineering flow to be faster, more iterative, and generally more responsive to conditions in our rapidly changing world.  Faster is, of course, relative depending on what you are talking about.  I recently saw this article on the MIT Technology Review about DARPA (always a source of cool advanced engineering ideas) undertaking a rapid approach for getting a new tank designed and built.

Article here:  http://www.technologyreview.com/news/509311/darpa-wants-to-remake-manufacturing/

The article thematically addresses concepts like ensuring a common understanding of the design among contributing engineers and moving manufacturing knowledge closer to the design stage so it is actually a part of the design thinking.  My DevOps skew made the immediate association of how similar this was to the collaboration implicit with Agile and DevOps.  Everyone needs to know the architecture and Ops needs to be involved directly with development while development is underway to ensure rapid Continuous Delivery cycles.  It’s a good perspective on how applicable these concepts are on a much broader scale and in varied industries.

I figure that if these guys can do it with metal in the context of a tank, it has to be possible with whatever software or virtualization problem I”m dealing with.  Though it does make me want better toys for our office.  I have to believe that DARPA has cooler Nerf guns…