Automation and the Definition of Done

“Done” is one of the more powerful concepts in human endeavor. Knowing that something is “done“ enables us to move on to the next endeavor, allows us to claim compensation, and sends a signal to others that they can begin working with whatever we have produced. However, assessing done can be contentious – particularly where the criteria are undefined. This is why the ‘definition of done’ is a major topic in software delivery. Software development has a creative component that can lead to confusion and even conflict. It is not a trivial matter.

Automation in the software delivery process forces the team to create a clear set of completion criteria early in the effort, thus reducing uncertainty around when things are ‘done’ as well as what happens next. Though they at first appear to be opposites, with done defining a stopping point and automation being much more about motion, the link between ‘done’ and automation is synergistic. Being good at one makes the team better at the other and vice-versa. Being good at both accelerates and improves the team’s overall capability to deliver software.

A more obvious example of the power of “done” appears in the Agile community. For example, Agile teams often have a doctrine of ‘test driven development’ where developers should write the tests first. Further examples include the procedural concepts for completing the User Stories in each iteration, or sprint, so that the team can clearly assess completion in the retrospective. Independent of these examples, validation-centric scenarios are an obvious area where automation can help underpin “done”. In the ‘test-driven development’ example, test suites that run at various points provide unambiguous feedback over whether more work is required. Those test suites become part of the Continuous Integration (CI) job so that every time a developer commits new code. If those pass, then the build automatically deploys into the integration environment for further evaluation.

Looking a bit deeper at the simple process of automatically testing CI builds reveals how automation forces a more mature understanding of “done”. Framed another way, the fact that the team has decided to have that automated assessment means that they have implicitly agreed to a set of specific criteria for assessing their ‘done-ness’. That is a major step for any group and evidence of significant maturation of the overall team.

That step of maturation is crucial, as it enables better flows across the entire lifecycle. For example, understanding how to map ‘done-ness’ into automated assessment is what enables advanced delivery methodologies such as Continuous Delivery. Realistically, any self-service process, whether triggered deliberately by a button push or autonomously by an event, such as delivering code, cannot exist without a clear, easily communicated, understanding of when that process is complete and how successful it was. No one would trust the automation were it otherwise.

There is an intrinsic link between “Done” and automation. They are mutual enablers. Done is made clearer, easier and faster by automation. Automation, in turn, forces a clear definition of what it means to be complete, or ‘done’. The better the software delivery team is at one, the stronger that team is at the other.

 

This article is also on LinkedIn here: https://www.linkedin.com/pulse/automation-definition-done-dan-zentgraf

Advertisements

Ops Heroes are NOT Qualified to do Anything with Nothing

There is a certain “long-suffering and misunderstood” attitude that shows up a lot in Operations. I have seen this quote on a number of cube walls:

We the willing, 
led by the unknowing, 
are doing the impossible 
for the ungrateful. 

We have done so much, 
with so little, 
for so long, 
we are now qualified to do anything, 
with nothing. 

Note: This quote is often mistakenly attributed to Mother Teresa. It was actually from this other guy called Konstantin Josef Jireček that no one has heard of recently.

The problem, of course, is that this attitude is counter-productive in a DevOps world. It promotes the culture that operations will ‘get it done’ no matter what how much is thrown their way in terms of budget cuts, shortened timeframes, uptime expectations, etc. It is a great and validating thing in some ways – you pulled off the impossible and get praise heaped on you. It is really the root of defective ‘hero culture’ behaviors that show up in tech companies or tech departments. And no matter how many times we write about the defectiveness of hero culture in a sustained enterprise, the behavior persists due to a variety of larger societal attitudes.

If you have seen (or perpetuated) such a culture, do not feel too bad – aspects of it show up in other disciplines including medicine. There is a fascinating discussion of this – and the cultural resistance to changing the behaviors – in Atul Gawande’s book, The Checklist Manifesto. The book is one of my favorites of the last couple of years. It discusses the research Dr (yes – he is a surgeon himself) Gawande did on why the instance of complications after surgery was so high relative to other high-criticality activities. He chose aviation – which is a massively complex and yet very precise, life-critical industry. It also has a far better record of incident free activity relative to the more intimate and expertise-driven discipline of medicine. The book proceeds to look at the evolution of the cultures of both industries and how one developed a culture focused on the surgeon being omniscient and expert in all situations while the other created an institutional discipline that seeks to minimize human fallibility in tense situations.

He further looks into the incentives surgeons have – because they have a finite number of hours in the day – to crank through procedures as quickly as possible. That way they generate revenue and do not tie up scarce and expensive operating rooms. But surgeons really can only work so fast and procedures tend to take as long as they do for a given patient’s situation. Their profession is manual and primarily scales based on more people doing more work. Aviation exploits the fact that it deals with machines and has more potential for instrumentation and automation.

The analogy is not hard to make to IT Operations people having more and more things to administer in shorter downtime windows. IT Operations culture, unfortunately, has much more in common with medicine than it does with aviation. There are countless points in the book that you should think about the next time you are logged in with root or equivalent access and about to manually make a surgical change… What are you doing to avoid multitasking? What happens if you get distracted? What are you doing to leverage/create instrumentation – even something manual like a checklist – to ensure your success rate is better each time? What are you doing to ensure that what you are doing can be reproduced by the next person? It resonates…

The good news is that IT Operations as a discipline (despite its culture) deals with machines. That means it is MUCH easier to create tools and instrumentation that leverage expertise widely while at the same time improving the consistency with which tasks are performed. Even so, I have heard only a few folks mention it at DevOps events and that is unfortunate, because the basic discipline of just creating good checklists – and the book discusses how – is a powerful and immediately adoptable thing that any shop, regardless of platform, toolchain, or history can adopt and readily benefit from. It is less inspirational and visionary than The Phoenix Project,  but it is one of the most practical approaches of working toward that vision that exists.

The book is worth a read – no matter how DevOps-y your environment is or wants to be. I routinely recommend it to our junior team members as a way to help them learn to develop sustainable disciplines and habits. I have found this to be a powerful tool for managing overseas teams, too.

I would be interested in anyone’s feedback who is using checklist techniques – particularly as an enhancement / discipline roadmap in a DevOps shop. I have had some success wrapping automation and instrumentation (as well as figuring out how to prioritize where to add automation and instrumentation) by building checklists for things and would love to talk about it with others who are experimenting with it.

DevOps is about Building Fords, not Ferraris

There is an interesting obsession with having the ‘ultimate’ of whatever you’re talking about. This applies to most things in our society: jobs, houses, televisions, cars. You name it, there is an ‘ultimate’ version that everyone aspires to have. There is a lot of good to this behavior, to be sure. I believe strongly that everyone should be trying to get better all the time. Though I would point out that it is healthier to regard the ultimate [whatever] as a consequence or benefit of getting better rather than an end unto itself.

But it’s usually bad to want the ‘ultimate’ in your software delivery process. Goldplating has always been an enemy in software projects and there is evidence of it in how a lot of organizations have traditionally delivered software. It usually shows up in the culture, where high-intervention processes lead to hero cults and aspirations to be the ultimate ‘hero’ who gets releases out the door. Old-school, old-world hand craftsmanship is the order of the day. DevOps is the exact opposite of this approach. It focuses on a highly repeatable, scalable, and mass-produced approach to releasing software. And frequently.

Which brings me back to the contrast between a Ferrari and a Ford. A Ferrari is pretty much the ultimate sports car and ultimate sports car brand. There really is very little not to like. But the cars are exotics still built with expensive materials using manual, old world techniques. To be fair, Ferrari has a super-modern robotic process for a lot of their precision work, but they add a lot of customization and hand-finishing. And they ship a very few thousand releases (cars) each year. Sustaining such a car in the real world involves specially trained mechanics named Giuseppe, long waits for parts from Italy, and even shipping the car across the state if you don’t live close to a qualified shop. No biggie – if you can afford the car, you can afford the maintenance. But, let’s face it, they are a ‘money is no object’ accessory.

Ford has shipped a variety of performance models over the years based on the Mustang platform. In fact, there have been years where Ford has shipped more performance Mustangs in a week than Ferrari would ship cars in that YEAR. And there is a magic there for a DevOps geek. Plain ol’ Ford Motor Company has started selling a 200mph Mustang this year for about $60K. There’s nothing too exotic about it. You can go to your local Ford dealer and buy it. It can be purchased at one dealer and serviced at any other dealer anywhere in the country. Parts? No problem – most of them are in local warehouses stationed strategically so that no dealer would have to keep a customer waiting too long for common items. A lot of stuff can be had from your local AutoZone because, well, it’s “just” a Mustang.

The lesson, though, is that Ford has an economy of scale by virtue of the volume of Mustangs it produces. No, a Mustang is not as nice or as custom as a Ferrari. It is as common and mass-produced as anything. But a 200mph car that anyone can buy for noticeably less than a house, get parts easily, and have serviced at thousands of locations is an amazing and magical thing. It teaches a solid lesson about scalability and sustainability that should be inspirational for DevOps teams.

And maybe, just maybe, if your company does a good enough job at sustainably delivering your software, you might be able to afford that Ferrari someday…

PS – for Chevy zealots. I realize the Corvette cleared 200 on a “volume” platform first. But the 200mph Plastic Fantastic looks more exotic relative to the Mustang – which has a plain “sporty commuter” or even rental fleet version with a V6. And the common example of the economies of scale mean that the 200mph Shelby Mustang is still a bargain relative to the 200mph capable ‘vette, which is the point of this post.

DevOps DARPA Style

I think spending a lot of time on DevOps may skew my interpretation of different trends and articles.  To me it seems that everyone is trying to reinvent and “lean out” there design to engineering flow to be faster, more iterative, and generally more responsive to conditions in our rapidly changing world.  Faster is, of course, relative depending on what you are talking about.  I recently saw this article on the MIT Technology Review about DARPA (always a source of cool advanced engineering ideas) undertaking a rapid approach for getting a new tank designed and built.

Article here:  http://www.technologyreview.com/news/509311/darpa-wants-to-remake-manufacturing/

The article thematically addresses concepts like ensuring a common understanding of the design among contributing engineers and moving manufacturing knowledge closer to the design stage so it is actually a part of the design thinking.  My DevOps skew made the immediate association of how similar this was to the collaboration implicit with Agile and DevOps.  Everyone needs to know the architecture and Ops needs to be involved directly with development while development is underway to ensure rapid Continuous Delivery cycles.  It’s a good perspective on how applicable these concepts are on a much broader scale and in varied industries.

I figure that if these guys can do it with metal in the context of a tank, it has to be possible with whatever software or virtualization problem I”m dealing with.  Though it does make me want better toys for our office.  I have to believe that DARPA has cooler Nerf guns…

Slides from Agile Austin Talk 2-12-2013

This post slides from my talk at Agile Austin on February 12, 2013.

I want to thank the group for the opportunity and the audience for the great interaction!

Link to slides:  DevOps Beyond the Basics – FINAL

A System for Changing Systems – Part 9

The last capability area in the framework is that of Monitoring. I saved this for last because it is the one that tends to be the most difficult to get right. Of course, commensurate with the difficulty is the benefit gained when it is working properly. A lot of the difficulty and benefit with Monitoring comes from the fact that knowing what to look at, when to look at it and what NOT to look at are only the first steps. It also becomes important to know what distributed tidbits of information to bring together if you actually want a complete picture of your application environment.

Monitoring

Monitoring Capability Area

This post could go for pages – and Monitoring is likely going to be a consuming topic as this series progresses, but for the sake of introduction, lets look at the Monitoring capability area. The sub-capabilities for this area encompass the traditional basics of monitoring Events and Trends among them. The challenge for these two is in figuring out which Events to monitor and sometimes how to get the Event data in the first place. The Trends must then be put into a Report format that resonates with management. It is important to invest in this area in order to build trust with management that the team has control as it tries to increase the frequency of changes – without management’s buy-in, they won’t fund the effort. Finally, the Correlation sub-capability area is related to learning about the application system’s behavior and how changes to some part of the system impacts the other parts. This is an observational knowledge base that must be deliberately built by the team over time so that they can put the Events, Trends, and Reports into the most useful contexts and use the information to better understand risks and priorities when making changes to the system.

A System for Changing Systems – Part 8

The fourth capability area is that of Provisioning. It covers the group of activities for creating all or part of an environment in which an application system can run. This is a key capability for ensuring that application systems have the capacity they need to maintain performance and availability. It is also crucial for ensuring that development and test activities have the capacity they need to maintain THEIR performance. The variance with test teams is that a strong Provisioning capability also ensures that development and test teams can have clean dev/test environments that are very representative of prorduction environments and can very quickly refresh those dev/test environments as needed. The sub-capabilities here deal with managing the consistency of envionment configurations, and then quickly building environments to a known state.

Provisioning Capability Area

Provisioning Capability Area

The fifth capability area is closely related to Provisioning. It is the notion of a System Registry capability. This set of capabilities deals with delivering the assumed infrastructure functions (e.g. DNS, e-mail relays, IP ranges, LDAP, etc.) that surround the environments. These capabilities must be managed in such a way that one or more changes to an application system can be added to a new or existing environment with out significant effort or disruption. In many ways this capability area is the fabric in which the others operate. It can also be tricky to get right because this capability area often spans multiple application systems.

System Registry Capability Area

System Registry Capability Area