Management, Leadership, Continuous Improvement, and DevOps

There is a management aphorism that “If you can’t measure it, you can’t manage it.” That gets attributed to Peter Drucker, though it is not actually what he said. What he said was “If you can’t measure it, you can’t IMPROVE it”. That’s an important difference as we talk about bringing DevOps and its related practices and disciplines to enterprises.

If you think about it, measuring with an intent to improve something is a much more challenging statement. Management of something is usually about keeping it within known parameters – maintaining a certain status quo. That is not to imply that Management is not valuable – it is absolutely crucial in maintaining a level of rigor to what is going on. But Improvement deliberately pressures the status quo to redefine the status quo at a new point. In fact, redefining a status quo to a better state sounds an awful lot like what we talk about in the DevOps movement.

Improvement always sounds very cool, but there is also an icky truth about Improvement – it is a relative thing. There are no easy answers for questions like:

  • ‘What point are we improving to?’
  • ‘How do we know when we have improved enough for a while in one area?’
  • ‘What is the acceptable rate of progress toward the improved state?’
  • and so on…

Those must be answered carefully and the answers must be related to each other. Answering those questions requires something different from Management. It requires Leadership to provide a vision. That brings us to another famous Drucker quote: “Management is doing things right; leadership is doing the right things.”

That quote is a sharp observation, but it does not really judge one as ‘better’ than the other. Leadership is exciting and tends to be much more inspirational at the human level. It therefore usually gets more attention in transitional efforts. However, without the balance of Management, Leadership may not be a sustainable thing in a business situation.

In terms of DevOps and the discipline of Continuous Improvement, the balance of these two things can be articulated with relative clarity. Leadership provides the answers for the hard questions. Management provides the rigor and discipline to maintain steady progress toward the new status quo defined by those answers. Put more simply, Leadership sets forth the goals and Management makes sure we get to those goals.

There is a certain bias in DevOps where we value Leadership – the desire to set and pursue improvement of our daily tech grind. Maybe that is because DevOps is an emergent area that requires a certain fortitude and focus on the doing the right things to get it started. And Leadership is certainly good for that. However, I also work with organizations where the well-intended, but unfocused, efforts of leadership-minded people lead to chaos. And those DevOps ‘transformations’ tend to flounder and even make things worse for the people involved. Which is not very DevOps at all.

I have seen enough of these that I have been spending time lately trying to organize my thoughts on the balance point. In the meantime, a piece of advice when you want to pursue a great idea / innovation – figure out how you want to answer the hard questions so you can make them stick in your organization and truly reap the benefit of that idea. Then, you can get on to the next one, and the next one, and the next one – to achieve the steady improvement of your status quo that is near the heart of DevOps culture.

This article is also on LinkedIn here: https://www.linkedin.com/pulse/management-leadership-continuous-improvement-devops-dan-zentgraf

Advertisement

Predictability is Predictably Hard

In order to successfully automate something, the pieces being automated have to be ‘predictable’. I use ‘predictable’ here – rather than ‘consistent’ – deliberately. A ‘predictable’ environment means you can anticipate its state and configuration. ‘Consistent’ gets misconstrued as ‘unchanging’, which is the opposite of what Agile software delivery is trying to achieve.

Consider deploying a fresh build of an application into a test environment. If you cannot predict what the build being deployed looks like and how the environment will be set up, why would you expect to reliably be able to get that build working in that environment in a predictable window of time? And yet, that is exactly what so many teams do.

The proposed solution is usually to automate the deployment. That, however, leads to its own problems if you do not address the predictability of the underlying stuff being automated. I talk to teams with stories about how they abandoned automation because it ‘slowed things down’ or ‘just did not work’. That leads teams to say, and in some cases believe, that their applications are ‘too complex to deploy automatically’.

At the heart of achieving predictability of the code packages and environments is the fact that they are different teams. Somehow it is harder to collaborate with the developers or operations team than it is to spend months attempting to build a mountain of hard to maintain deployment code. A mountain of code that stands a good chance of being abandoned, by the way. That represents months of wasted time, effort, and life because people working on the same application do not collaborate or cooperate.

And we get another example of why so many DevOps conversations become about culture rather than technology… Which really sucks, because that example is at the expense of a fair bit of pain from the real people on those teams.

The lesson here is that there is no skipping the hard work of establishing predictability in the packaging of the code and environments before charging into automating deployments. We are in an era now where really good packaging and configuration management tools are very mature.
And the next generation of tools that unifies the code and environment changes into immutable, deployable, and promotable artifacts is coming fast. But even with the all of these awesome tools, cross-disciplinary experts will have to come together to contribute to the creation of predictable versions of those artifacts.

The ‘C’ in CAMS stands for “Collaboration”. There are no shortcuts.

This article is also on LinkedIn here: https://www.linkedin.com/pulse/predictability-predictably-hard-dan-zentgraf/

Old Habits Make DevOps Transformation Hard

My father is a computer guy. Mainframes and all of the technologies that were cool a few decades ago. I have early memories of playing with fascinating electro-mechanical stuff at Dad’s office and its datacenter. Printers, plotters, and their last remaining card punch machine in a back corner. Crazy cool stuff for a kid if you have ever seen that gear in action. There’s all kinds of noise and things zipping around.

Now the interesting thing about talking to Dad is that he is seriously geeky about tech. Always fascinated by the future of how tech would be applied and he completely groks the principals and potentials of new technology even if he does not really get the specific implementations. Recently he had a problem printing from his iPhone. He had set it up a long time ago and it worked great. He’s 78 and didn’t bat an eye at connecting his newfangled mobile device to his printer. What was interesting was his behavior when the connection stopped working. He tried mightily to fix the connection definition rather than deleting the configuration and simply recreating it with the wizard. That got me thinking about “fix it” behavior and troubleshooting behavior in IT.

My dad, as an old IT guy, had long experience and training that you fix things when they got out of whack. You certainly didn’t expect to delete a printer definition back in the day – you would edit the file, you would test it, and you would fiddle with it until you got the thing working again. After all, you had just the relatively few pieces of equipment in the datacenter and offices. That makes no sense in a situation where you can simply blow the problematic thing away and let the software automatically recreate it.

And that made me think about DevOps transformations in the enterprise.

I run into so many IT shops where people far younger than my dad struggle mightily to troubleshoot and fix things that could (or should) be easily recreated. To be fair – some troubleshooting is valuable and educational, but a lot is over routine stuff that is either well known, industry standard, or just plain basic. Why isn’t that stuff in an automated configuration management system? Or a VM snapshot? Or a container? Heck – why isn’t it in the Wiki, at least?! And the funny thing is that these shops are using virtualization and cloud technologies already, but treat the virtual artifacts the same way as they did the long-lasting, physical equipment-centric setups of generations past. And that is why so many DevOps conversations come back to culture. Or perhaps ‘habit’ is a better term in this case.

Breaking habits is hard, but we must if we are to move forward. When the old ways do not work for a retired IT guy, you really have to think about why anyone still believes they work in a current technology environment.

This article is on LinkedIn here: https://www.linkedin.com/pulse/old-habits-make-devops-transformation-hard-dan-zentgraf

Your Deployment Doc Might Not be Useful for DevOps

One of the most common mistakes I see people making with automation is the assumption that they can simply wrap scripts around what they are doing today and be ‘automated’. The assumption is based around some phenomenally detailed runbook or ‘deployment document’ that has every command that must be executed. In ‘perfect’ sequence. And usually in a nice bold font. It was what they used for their last quarterly release – you know, the one two months ago? It is also serving as the template for their next quarterly release…

It’s not that these documents are bad or not useful. They are actually great guideposts and starting points for deriving a good automated solution to releasing software in their environment. However, you have to remember that these are the same documents that are used to guide late night, all hands, ‘war room’ deployments. The idea that their documented procedures are repeatablly automate-able is suspect, at best, based on that observation alone.

Deployment documents break down as an automate-able template for a number of reasons. First, there are almost always some number of undocumented assumptions about the state of the environment before a release starts. Second, using the last one does not account for procedural, parameter, or other changes between the prior and the upcomming releases. Third, the processes usually unconsciously rely on interpretation or tribal knowledge on the part of the person executing the steps. Finally, there is the problem that steps that make sense in a sequential, manual process will not take advantage of the intrinsic benefits of automation, such as parallel execution, elimination of data entry tasks, and so on.

The solution is to never set the expectation – particularly to those with organizational power – that the document is only a starting point. Build the automation iteratively and schedule multiple iterations at the start of the effort. This can be a great way to introduce Agile practices into the traditionally waterfall approaches used in operations-centric environments. This approach allows for the effort that will be required to fill in gaps in the document’s approach, negotiate standard packaging and tracking of deploy-able artifacts, add environment ‘config drift’ checks, or any of the other common ‘pitfall’ items that require more structure in an automated context.

This article is also on LinkedIn here: https://www.linkedin.com/pulse/your-deployment-doc-might-useful-devops-dan-zentgraf

Ops Heroes are NOT Qualified to do Anything with Nothing

There is a certain “long-suffering and misunderstood” attitude that shows up a lot in Operations. I have seen this quote on a number of cube walls:

We the willing, 
led by the unknowing, 
are doing the impossible 
for the ungrateful. 

We have done so much, 
with so little, 
for so long, 
we are now qualified to do anything, 
with nothing. 

Note: This quote is often mistakenly attributed to Mother Teresa. It was actually from this other guy called Konstantin Josef Jireček that no one has heard of recently.

The problem, of course, is that this attitude is counter-productive in a DevOps world. It promotes the culture that operations will ‘get it done’ no matter what how much is thrown their way in terms of budget cuts, shortened timeframes, uptime expectations, etc. It is a great and validating thing in some ways – you pulled off the impossible and get praise heaped on you. It is really the root of defective ‘hero culture’ behaviors that show up in tech companies or tech departments. And no matter how many times we write about the defectiveness of hero culture in a sustained enterprise, the behavior persists due to a variety of larger societal attitudes.

If you have seen (or perpetuated) such a culture, do not feel too bad – aspects of it show up in other disciplines including medicine. There is a fascinating discussion of this – and the cultural resistance to changing the behaviors – in Atul Gawande’s book, The Checklist Manifesto. The book is one of my favorites of the last couple of years. It discusses the research Dr (yes – he is a surgeon himself) Gawande did on why the instance of complications after surgery was so high relative to other high-criticality activities. He chose aviation – which is a massively complex and yet very precise, life-critical industry. It also has a far better record of incident free activity relative to the more intimate and expertise-driven discipline of medicine. The book proceeds to look at the evolution of the cultures of both industries and how one developed a culture focused on the surgeon being omniscient and expert in all situations while the other created an institutional discipline that seeks to minimize human fallibility in tense situations.

He further looks into the incentives surgeons have – because they have a finite number of hours in the day – to crank through procedures as quickly as possible. That way they generate revenue and do not tie up scarce and expensive operating rooms. But surgeons really can only work so fast and procedures tend to take as long as they do for a given patient’s situation. Their profession is manual and primarily scales based on more people doing more work. Aviation exploits the fact that it deals with machines and has more potential for instrumentation and automation.

The analogy is not hard to make to IT Operations people having more and more things to administer in shorter downtime windows. IT Operations culture, unfortunately, has much more in common with medicine than it does with aviation. There are countless points in the book that you should think about the next time you are logged in with root or equivalent access and about to manually make a surgical change… What are you doing to avoid multitasking? What happens if you get distracted? What are you doing to leverage/create instrumentation – even something manual like a checklist – to ensure your success rate is better each time? What are you doing to ensure that what you are doing can be reproduced by the next person? It resonates…

The good news is that IT Operations as a discipline (despite its culture) deals with machines. That means it is MUCH easier to create tools and instrumentation that leverage expertise widely while at the same time improving the consistency with which tasks are performed. Even so, I have heard only a few folks mention it at DevOps events and that is unfortunate, because the basic discipline of just creating good checklists – and the book discusses how – is a powerful and immediately adoptable thing that any shop, regardless of platform, toolchain, or history can adopt and readily benefit from. It is less inspirational and visionary than The Phoenix Project,  but it is one of the most practical approaches of working toward that vision that exists.

The book is worth a read – no matter how DevOps-y your environment is or wants to be. I routinely recommend it to our junior team members as a way to help them learn to develop sustainable disciplines and habits. I have found this to be a powerful tool for managing overseas teams, too.

I would be interested in anyone’s feedback who is using checklist techniques – particularly as an enhancement / discipline roadmap in a DevOps shop. I have had some success wrapping automation and instrumentation (as well as figuring out how to prioritize where to add automation and instrumentation) by building checklists for things and would love to talk about it with others who are experimenting with it.

A System for Changing Systems – Part 3 – How Many “Chang-ee”s

As mentioned in the last post, once there is a “whole system” understanding of an application system, the next problem is that there are really multiple variants of that system running within the organization at any given time. There are notionally at least three: Development, Test, and Production. In reality, however, most shops frequently have multiple levels of test and potentially more than one Development variant. Some even have Staging or “Pre-production” areas very late in test where the modified system must run for some period before finally replacing the production environment. A lot of this environment proliferation is based on historic processes that are themselves a product of the available tooling and lessons organizations have learned over years of delivering software.

Example Environment Flow

This is a simplified, real-world example flow through some typical environments. Note the potential variable paths – another reason to know what configuration is being tested.

Tooling and processes are constantly evolving. The DevOps movement is really a reflection of the mainstreaming of Agile approaches and cloud-related technologies and is ultimately a discussion of how to best exploit it. That discussion, as it applies to environment proliferation, means we need to get to an understanding of the core problems we are trying to solve. The two main problem areas are maintaining the validity of the sub-production environments as representative of production and tracking the groupings of changes to the system in each of the environments.

The first problem area, that of maintaining the validity of sub-production envrionments, is a more complex problem than it would seem. There are organizational silo problems where multiple different groups own the different environments. For example, a QA group may own the lab configuraitons and therefore have a disconnect relative to the production team. There are also multipliers associated with technical specialities, such as DBAs or Network Administration, which may be shared across some levels of environment. And if the complexity of the organization was not enough, there are other issues associated with teams that do not get along well, the business’ perception that test environments are less critical than production, and other organizational dynamics that make it that much more difficult to ensure good testing regimes are part of the process.

The second key problem area that must be addresssed is tracking the groups of changes to the application system that are being evaluated in a particular sub-production environment. This means having a unique identifier for the combination of application code, the database schema and dataset, system configuration, and network configuration. That translates to five version markers – one for each of the main areas of the application system plus one for the particular combination of all four. On the surface, this is straightforward, but in most shops, there are few facilities for tracking versions of configurations outside of software code. Even when they are, they are too often not connected to one another for tracking groupings of configurations.

They typical pattern for solving these two problems actually begins with the second problem first. It is difficult to ensure the validity of a test environment if there is no easy way to identify and understand the configuration of the components involved. This is why many DevOps initiatives start with configuration management tools such as Puppet, Chef, or VMWare VCenter. It is also why “all-in-one” solutions such as IBM’s Pure family are starting to enter the market. Once an organization can get a handle on their configurations, then it is substantially easier to have fact-based engineering conversations about valid test configurations and environments because everyone involved has a clear reference for understanding exactly what is being discussed.

This problem discussion glosses over the important aspect of being able to maintain these tools and environments over time. Consistently applying the groups of changes to the various environments requires a complex system by itself. The term system is most appropirate because the needed capabilities go well beyond the scope of a single tool and then those capabilities need to be available for each of the system components. Any discussion of such broad capabilities is well beyond the scope of a single blog post, so the next several posts in this series will look at framework for understanding the capabilities needed for such a system.

A System for Changing Systems – Part 1 – Approach

This is the first post in a series which will look at common patterns among DevOps environments.  Based on these patterns, they will attempt to put a reasonable structure together that will help organizations focus DevOps discussions, prioritize choices, and generally improve how they operate.

In the last post, I discussed how many shops take the perspective of developing a system for DevOps within their environments.  This notion of a “system for changing systems” as a practical way of approaching DevOps requires two pieces.  The first is the system being changed – the “change-ee” system.  The second is the system doing the changing – the “DevOps”, or “change-er” system.  Before talking about automatically changing something, it is necessary to have a consistent understanding of the thing being changed.  Put another way, no automation can operate well without a deep understanding of the thing being automated.  So this first post is about establishing a common language for generically understanding the application systems; the “change-ee” systems in the discussion.

A note on products, technologies and tools…  Given the variances in architectures for application (“change-ee”) systems, and therefore the implied variances on the systems that apply changes to them, it is not useful to get product prescriptive for either.  In fact, a key goal with this framework is to ensure that it is as broadly applicable and useful as possible when solving DevOps-related problems in any environment.  That would be very difficult if it overly focused on any one technology stack.  So, these posts will not necessarily name names other than to use them as examples of categories of tools and technologies.

With these things in mind, these posts will progress from the inside-out.  The next post will begin the process with a look at the typical components in an application system (“change-ee”).  From there, the next set of posts will discuss the capabilities needed to systematically apply changes to these systems.  Finally, after the structure is completed, the last set of posts will look at the typical progression of how organizations build these capabilities.

The next post will dive in and start looking at the structure of the “change-ee” environment.

How Fast Should You Change the Tires?

I am an unabashed car nut and like to watch a variety of motor racing series. In particular I tend to stay focused on Formula 1 with a secondary interest in the endurance series (e.g Le Mans). In watching several races recently, I observed that the differences in how each series managed tire changes during pit stops carried some interesting analogies to deploying software quickly.

Each racing series has a different set of rules and limitations with regard to how pit stops may be conducted. These rules are imposed for a combination of safety reasons, competitive factors, and the overall viability of the racing series. There are even rules about changing tires. Some series enable very quick tire changes – others less so. The reasons behind these differences and how they are applied by race teams in tight, time competitive situations can teach us lessons about the haste we should or should NOT have when deploying software.

Why tire changes? The main reason is that, like deploying software, there are multiple potential points of change (4 tires on the car – software, data, systems, network with the software). And, in both situations, it is less important how fast you can change just one of them than how fast you change all of them. There is even the variants where you may not need to change all 4 tires (or system components) every time, but you must be precise in your changes.

Formula 1

Formula 1 is a fantastically expensive racing series and features extreme everything – including the fastest pit stops in the business. Sub 4-second stops are the norm, during which all 4 tires are changed. There are usually around18 people working on the car – 12 of whom are involved in getting the old tires off and clear while putting new tires on (not counting another 2 to work the jacks). That is a large team, with a lot of expensive people on it, who invest a LOT of expensive time practicing to ensure that they can get all 4 tires changed in a ridiculously short period of time. And they have to do it for two cars with potentially different tire use strategies, do it safely, while competing in a sport that measures advantage in thousandths of a second.

But, there is a reason for this extreme focus / investment in tire changes. The tire changes are the most complex piece of work being done on the car during a standard pit stop. Unlike other racing series, there is no refueling in Formula 1 – the cars must have the range to go the full race distance. In fact, the races are distance and time limited, so the components on the cars are simply engineered to go that distance without requiring service, and therefore time, during the race. There are not even windows to wash – it is an open cockpit car. So, the tires are THE critical labor happening during the pit stop and the teams invest accordingly.

Endurance (Le Mans)

In contrast to the hectic pace of a Formula 1 tire change is Endurance racing. These are cars that are built to take the abuse of racing for 24 hours straight. These cars require a lot of service over the course of that sort of race and the tires are therefore only one of several critical points that have to be serviced in the course of a race. Endurance racers have to be fueled, have brake components replaced, and the three drivers have to switch out periodically so they can rest. The rules of this series, in fact limit the number of tire wrenches the team can use in the pits to just one. That is done to discourage teams from cutting corners and also to keep team size (and therefore costs) down.

NASCAR

NASCAR is somewhere between Formula 1 and Endurance racing when it comes to tire changes. This series limits tire wrenches to two and tightly regulates the number of people working on the car during a pit stop. These cars require fuel, clean-up, and tires just like the Endurance cars, but generally do not require any additional maintenance during a race, barring damage. So, while changing tires quickly is important, there are other time eating activities going on as well.

Interestingly, in addition to safety considerations, NASCAR limits personnel to keep costs down to help the teams competing in the series afford the costs of doing so. That keeps the overall series competition healthy by ensuring a good number of participants and the ability of new teams to enter. Which, to contrast, is one of the problems that Formula 1 has had over the years.

In comparing the three approaches to the same activity, you see an emerging pattern where ultimate speed of changing tires gets traded based on cost and contextual criticality. These are the same trade-offs that are made in a business when it looks at how much faster it can perform a regular process such as deploy software. You could decide you want sub-four second tire changes, but that would be dumb if your business needs 10 seconds for refueling or several minutes for driver swaps and brake overhauls. And if they do, your four second tire change would look wasteful at best as your army of tire guys stands around and watches the guy fueling the car or the new driver adjusting his safety harnesses.

The message here is simple – understand what your business needs when it comes to deployment. Take the thrill of speed out of it and make an unemotional decision to optimize; knowing that optimal is contextually fastest without waste. Organizations that literally make their living from speed undestand this. You should consider this the next time you go looking to do something faster.

Rollback Addiction

A lot of teams are fascinated with the notion of ‘rollback’ in their environments.  They seem to be addicted to its seductive conceptual simplicity.  It does, of course, have valid uses, but, like a lot of things, it can become a self-destructive dependency if it is abused.  So, let’s take a look at some of the addictive properties of rollback and how to tell if you might have a problem.

Addictive property #1 – everybody is doing it.  One of the first things we learn in technology (Dev or Ops) is to keep a backup of a file when we change something.  That way, we know what we did in case it causes a problem.  In other words, our first one is free and was given to us by our earliest technology mentors.  As a result, everyone knows what it is.  It is familiar and socially acceptable.  We learn it while we are so professionally young that it becomes a “comfort food”.  The problem is that it is so pervasive that people do it without noticing.  They will revert things without giving it a second thought because it is a “good thing”.  And this behavior is a common reason rollback does not scale.  In a large system, where many people might be making changes, others might make changes based on your changes.  So, undoing yours without understanding others’ dependencies, means that you are breaking other things in an attempt to fix one thing.  If you are in a large environment and changes “backward” are not handled with the exact same rigor as changes forward, you might have a problem.

Addictive property #2 – it makes you feel good and safe.  The idea that “you are only as good as your last backup” is pretty pervasive.  So, the ability to roll something back to a ‘known good’ state gives you that warm, fuzzy feeling that it’s all OK.  Unfortunately, in large scale situations with any significant architectural complexity, it is probably not OK.  Some dependency is almost certainly unknown, overlooked or assumed to be handled manually.  That will lead to all sorts of “not OK” when you try to roll back.  If rollback is the default contingency plan for every change you make and you don’t systematically look at other options to make sure it is the right answer, you might have a problem.

Addictive property #3 – It is easy to sell.  Management does not understand the complexity of what is required to implement a set of changes, but they do understand “Undo”.  As a result, it is trivial to convince them that everything is handled and, if there should be a problem with a change, you can just ‘back it out’.  Being able to simplify the risks to an ‘undo’ type of concept can eliminate an important checkpoint from the process.  Management falls into the all to human behavior of assuming there is an ‘undo’ for everything and stops questioning the risk management plan because they think it is structurally covered.  This leads to all sorts of ugliness should there be a problem and the expectation of an easy back-out is not met.  Does your team deliberately check for oversights its contingency plan every time or does it assume that it will just ‘roll it back’?  If the latter, you might have a problem.

As usual, the fix for a lot of this is self-discipline.  The discipline to do things the hard and thorough way that takes just that little bit longer.  The discipline to institutionalize and reward healthy behaviors in your shop.  And, as usual, that goes against a fair bit of human nature that can be very difficult to overcome, indeed.

How a DOMO Fits

In my previous post, I discussed the notion of a DevOps Management Organization or DOMO.  As I said there, this is and idea that is showing up under different names at shops of varying sizes.  I thought I would share a drawing of one to serve as an example.  The basic structure is, of course, a matrix organization with the ability to have each key role present within the project.  It also provides for shared infrastructure services such as support and data.  You could reasonably easily replace the Business Analyst (BA) role with a Product Owner / Product Manager role and change “Project” to “Product” and have a variant of this structure that I have seen implemented at a couple of SaaS providers around Austin.

This structure does assume a level of maturity in the organization as well as the underlying infrastructure.  It is useful to note that the platform is designated as a “DevOps Platform”.  It would probably be better to phrase that as a cloud-type platform – public, private or hybrid – where the permanence of a particular image is low, but the consistency and automation is high.  To be sure, not all environments have built such an infrastructure, but many, if not most, are building them aggressively.  The best time to look at the organization is while those infrastructures are being built and companies are looking for the best ways to exploit them.

Organization with DOMO