There is a management aphorism that “If you can’t measure it, you can’t manage it.” That gets attributed to Peter Drucker, though it is not actually what he said. What he said was “If you can’t measure it, you can’t IMPROVE it”. That’s an important difference as we talk about bringing DevOps and its related practices and disciplines to enterprises.
If you think about it, measuring with an intent to improve something is a much more challenging statement. Management of something is usually about keeping it within known parameters – maintaining a certain status quo. That is not to imply that Management is not valuable – it is absolutely crucial in maintaining a level of rigor to what is going on. But Improvement deliberately pressures the status quo to redefine the status quo at a new point. In fact, redefining a status quo to a better state sounds an awful lot like what we talk about in the DevOps movement.
Improvement always sounds very cool, but there is also an icky truth about Improvement – it is a relative thing. There are no easy answers for questions like:
- ‘What point are we improving to?’
- ‘How do we know when we have improved enough for a while in one area?’
- ‘What is the acceptable rate of progress toward the improved state?’
- and so on…
Those must be answered carefully and the answers must be related to each other. Answering those questions requires something different from Management. It requires Leadership to provide a vision. That brings us to another famous Drucker quote: “Management is doing things right; leadership is doing the right things.”
That quote is a sharp observation, but it does not really judge one as ‘better’ than the other. Leadership is exciting and tends to be much more inspirational at the human level. It therefore usually gets more attention in transitional efforts. However, without the balance of Management, Leadership may not be a sustainable thing in a business situation.
In terms of DevOps and the discipline of Continuous Improvement, the balance of these two things can be articulated with relative clarity. Leadership provides the answers for the hard questions. Management provides the rigor and discipline to maintain steady progress toward the new status quo defined by those answers. Put more simply, Leadership sets forth the goals and Management makes sure we get to those goals.
There is a certain bias in DevOps where we value Leadership – the desire to set and pursue improvement of our daily tech grind. Maybe that is because DevOps is an emergent area that requires a certain fortitude and focus on the doing the right things to get it started. And Leadership is certainly good for that. However, I also work with organizations where the well-intended, but unfocused, efforts of leadership-minded people lead to chaos. And those DevOps ‘transformations’ tend to flounder and even make things worse for the people involved. Which is not very DevOps at all.
I have seen enough of these that I have been spending time lately trying to organize my thoughts on the balance point. In the meantime, a piece of advice when you want to pursue a great idea / innovation – figure out how you want to answer the hard questions so you can make them stick in your organization and truly reap the benefit of that idea. Then, you can get on to the next one, and the next one, and the next one – to achieve the steady improvement of your status quo that is near the heart of DevOps culture.
This article is also on LinkedIn here: https://www.linkedin.com/pulse/management-leadership-continuous-improvement-devops-dan-zentgraf
In order to successfully automate something, the pieces being automated have to be ‘predictable’. I use ‘predictable’ here – rather than ‘consistent’ – deliberately. A ‘predictable’ environment means you can anticipate its state and configuration. ‘Consistent’ gets misconstrued as ‘unchanging’, which is the opposite of what Agile software delivery is trying to achieve.
Consider deploying a fresh build of an application into a test environment. If you cannot predict what the build being deployed looks like and how the environment will be set up, why would you expect to reliably be able to get that build working in that environment in a predictable window of time? And yet, that is exactly what so many teams do.
The proposed solution is usually to automate the deployment. That, however, leads to its own problems if you do not address the predictability of the underlying stuff being automated. I talk to teams with stories about how they abandoned automation because it ‘slowed things down’ or ‘just did not work’. That leads teams to say, and in some cases believe, that their applications are ‘too complex to deploy automatically’.
At the heart of achieving predictability of the code packages and environments is the fact that they are different teams. Somehow it is harder to collaborate with the developers or operations team than it is to spend months attempting to build a mountain of hard to maintain deployment code. A mountain of code that stands a good chance of being abandoned, by the way. That represents months of wasted time, effort, and life because people working on the same application do not collaborate or cooperate.
And we get another example of why so many DevOps conversations become about culture rather than technology… Which really sucks, because that example is at the expense of a fair bit of pain from the real people on those teams.
The lesson here is that there is no skipping the hard work of establishing predictability in the packaging of the code and environments before charging into automating deployments. We are in an era now where really good packaging and configuration management tools are very mature.
And the next generation of tools that unifies the code and environment changes into immutable, deployable, and promotable artifacts is coming fast. But even with the all of these awesome tools, cross-disciplinary experts will have to come together to contribute to the creation of predictable versions of those artifacts.
The ‘C’ in CAMS stands for “Collaboration”. There are no shortcuts.
This article is also on LinkedIn here: https://www.linkedin.com/pulse/predictability-predictably-hard-dan-zentgraf/
My father is a computer guy. Mainframes and all of the technologies that were cool a few decades ago. I have early memories of playing with fascinating electro-mechanical stuff at Dad’s office and its datacenter. Printers, plotters, and their last remaining card punch machine in a back corner. Crazy cool stuff for a kid if you have ever seen that gear in action. There’s all kinds of noise and things zipping around.
Now the interesting thing about talking to Dad is that he is seriously geeky about tech. Always fascinated by the future of how tech would be applied and he completely groks the principals and potentials of new technology even if he does not really get the specific implementations. Recently he had a problem printing from his iPhone. He had set it up a long time ago and it worked great. He’s 78 and didn’t bat an eye at connecting his newfangled mobile device to his printer. What was interesting was his behavior when the connection stopped working. He tried mightily to fix the connection definition rather than deleting the configuration and simply recreating it with the wizard. That got me thinking about “fix it” behavior and troubleshooting behavior in IT.
My dad, as an old IT guy, had long experience and training that you fix things when they got out of whack. You certainly didn’t expect to delete a printer definition back in the day – you would edit the file, you would test it, and you would fiddle with it until you got the thing working again. After all, you had just the relatively few pieces of equipment in the datacenter and offices. That makes no sense in a situation where you can simply blow the problematic thing away and let the software automatically recreate it.
And that made me think about DevOps transformations in the enterprise.
I run into so many IT shops where people far younger than my dad struggle mightily to troubleshoot and fix things that could (or should) be easily recreated. To be fair – some troubleshooting is valuable and educational, but a lot is over routine stuff that is either well known, industry standard, or just plain basic. Why isn’t that stuff in an automated configuration management system? Or a VM snapshot? Or a container? Heck – why isn’t it in the Wiki, at least?! And the funny thing is that these shops are using virtualization and cloud technologies already, but treat the virtual artifacts the same way as they did the long-lasting, physical equipment-centric setups of generations past. And that is why so many DevOps conversations come back to culture. Or perhaps ‘habit’ is a better term in this case.
Breaking habits is hard, but we must if we are to move forward. When the old ways do not work for a retired IT guy, you really have to think about why anyone still believes they work in a current technology environment.
This article is on LinkedIn here: https://www.linkedin.com/pulse/old-habits-make-devops-transformation-hard-dan-zentgraf
Feature delivery in a DevOps/Continuous Delivery world is about moving small batches of changes from business idea through to production as quickly and frequently as possible. In a lot of enterprises, however, this proves rather problematic. Most enterprises organize their teams more around functional teams rather than value delivery channels. That leaves the business owners or project managers with the quandary of how to shepherd their changes through a mixture of different functional teams in order to get releases done. That shepherding effort means negotiating for things like schedules, resources, and capacity for each release– all of which take time that simply does not exist when using a DevOps approach. That means that the first step in transforming the release process is abandoning the amoebic flexibility of continuous negotiations in favor of something with more consistency and structure –a backbone for continuously delivering changes.
The problem, of course, is where and how to start building the backbone. Companies are highly integrated entities: disrupting one functional area tends to have a follow-on effect to other functions. For example, dedicating resources from shared teams to a specific effort implies backfilling those resources to the other efforts those resources previously supported. That takes time and money that are better spent. This is why automation very quickly comes to the forefront of any DevOps discussion as the driver of the flow. Automation is relatively easy to stand up in parallel to existing processes, it lends itself to incremental enhancement, and it has the intrinsic ability of multiplying the efforts of a relatively small number of people. It is relatively easy to prove the ROI of automating business processes; which, really, is why most companies got computers in the first place.
So, if automation is the backbone of a software delivery flow, how do you get to an automated flow? Most organizations follow three basic stages as they add automation to their software delivery flow.
- Continuous Integration to standardized artifacts (bonus points for a Repository)
- Automated Configuration Management (bonus for Provisioning)
Once a business has decided that it needs more features, sooner from a software development organization, the development teams will look at Agile practices. Among these practices is Continuous Integration (supporting the principle from the Agile manifesto that “working software is the primary measure of progress”). This will usually take the form of a relatively frequent build process that produces a consumable form of the code. The frequency of delivery creates a need for consistency in the artifacts produced by the build and a need to track the version progression of those artifacts. This can take the form of naming conventions, directory structures, etc., but eventually will lead to a binary artifact repository.
The availability of frequent updates stimulates a demand for more deployments to more consistent test environments and even to production. The difficulties driven by traditional test lab Configuration Management practices, the drift between test labs and production, and the friction pushing back on iterations causes Development to suddenly become a lot more interested in how their code actually gets into the hands of consumers. The sudden attention and spike in workload causes Operations to suddenly become very interested in what Development is doing. This is the classic case for why DevOps became such a topic in IT. The now classic solution for solving this immediate friction point is to use an automated configuration management solution to provide consistent, high-speed enforcement of a known configuration into all environments where the code will run. These systems are highly flexible, model and script-based and enable environment changes to be versioned in a way that is very consistent with how developed code is versioned. The configuration management effort will usually evolve to include integration to provisioning systems that enable rapid creation, deletion, or refresh of entire environments. Automating Configuration Management, then, becomes the second piece of the delivery backbone; however, environmental sprawl and lack of feedback loops quickly show the need for another level.
The third stage of growing a delivery backbone is Orchestration. It deals with the fact that there are so many pieces to modern application systems that it is inefficient and impractical for humans to track and manage them all in every environment to which those pieces might be deployed. Orchestration systems deal with deliberate sequencing of activities, integration with human workflow, and management of the pipeline of changes in large, complex environments. Tools that deal with this level of activity address the fundamental fact that, even in the face of high-speed automated capabilities, some things have to happen in a particular order. This can be due to technical requirements, but it is often due to fundamental business requirements: coordinating feature improvements with consuming user communities, coordinating with a marketing campaign, or the dreaded regulatory factors. Independent of the reason or need, Orchestration tools allow teams to answer questions of ‘what is in which environment’, ‘are the dependent pieces set’, and ‘what is the actual state of things over there’. In other words, as the delivery backbone grows, Orchestration delivers the precise control over the backbone and enables the organization to capture the full value of the automation.
These three stages of backbone growth, perhaps better termed ‘evolution’ may take slightly different forms in different organizations. However, they consistently appear as a natural progression as DevOps thinking becomes pervasive in organizations and those organizations seek to mature and optimize their delivery flows. Organizations seeking to accelerate their DevOps adoption can use variations on this pattern to deliberately evolve or accelerate the growth of their own delivery backbone.
This article is also on LinkedIn here: https://www.linkedin.com/pulse/grow-delivery-backbone-dan-zentgraf
Computers are far better at keeping records than humans and good logs are a crucial part of getting value from automating anything. Sometimes this aspect of automation gets lumped in with logging, but there is a difference between recording events and providing traceability. Both have value – the history of what happens in a system is important for a range of reasons ranging from the reactive to the proactive. On the reactive end, this record provides root cause analysis – an understanding of who did what and what happened. As things shift toward the proactive end of things, the valuable information can be used to trace how well an automated process is working, identify how it is evolving, and where it can be improved.
Starting at the basic end of the automation traceability spectrum it the simple concept of access and event logs. The very word ‘traceability’ often calls to mind the idea of auditors, investigators, and even inquisitors seeking to answer the question ‘who did what and when did they do it?’ In some organizations this is a very critical part of the business and is a valuable part of automation because it makes answering this question much easier. There is no time lost by having staff dredge up records and history. The logs are available to be turned into reports by anyone who might be interested. The productivity saved by letting people get on with their work while ensuring that those whose work it is to ensure the business meets its regulatory requirements can also get on with theirs. It actually is a true win-win, even if it is an awkward topic at times.
The other great reactive value lever of traceability in an automated environment is that it eases root cause analysis when problems occur. No system is perfect and they will always break down. The automation may even work perfectly, but still let an unforeseen problem escape into production. Good records of what happened facilitate root cause analysis. That saves time and trouble as engineers seek to figure out how to fix the problem at hand and are then tasked with making sure that the problem can never happen again. With good traceability, both sides of the task are less costly and time-consuming. Additionally, the resultant fix is more likely to be effective because there is more and better information available to create it.
Closely related to using traceability for root cause analysis and fixes is the notion of ensuring the automated process’ own health. Is there something going on with the process that could cause it to break down? This is much like a driver noticing that their car is making a new squeaking noise and taking it to the mechanic before major damage is done. The benefit of catching a potential problem early is, of course, that it can be dealt with before it causes an unplanned, costly disruption.
The fourth way that traceability makes automation valuable is that it provides the data required to perform continuous improvement. This notion is about being able to use the data produced by the automation to make something that is working well work better. While ‘better’ may have many definitions depending on the particular context or circumstance being discussed, there can be no structured way of achieving ANY definition of ‘better’ without being able to look at consistent data. And there are few better ways to get consistent data than to have it produced automatically as part of the process on which it is reporting.
Reaching the more proactive end of this spectrum requires time and a consistent effort to mature the tools, automations, and organization. However, traceability of automation builds on itself and is, in fact, the one of the three levers discussed in these posts that has the potential to build progressively more value the longer it is in use with no clear upper limit. That ability to return progressive value makes it worth the patience and discipline required.
Implicit in DevOps automation is the idea that the decision to make technical changes should be delegated to non-experts in the first place. Sure, automation can make an expert more productive, but as I discussed in my last post, the more people who can leverage the automation, the more valuable the automation is. So, the next question is how to effectively delegate the automation so that the largest number of people can leverage it – without breaking things and making others non-productive as a result.
This is a non-trivial undertaking that becomes progressively more complex with the size of the organization and the number of application systems involved. For bonus points, some industries are externally mandated to maintain a separation of duties among the people working on a system. There needs to be a mechanism through which a user can execute an automated process with higher authority than they normally have. Those elevated rights need to last only for time when that execution is running and limit the ability to affect the environment to a scope that is appropriate. Look at it this way – continuous delivery to production does not imply giving root to every developer on the team so they can push code. There are limits imposed by what I call a ‘credentials proxy’.
A credentials proxy is simply a mechanism that allows a user to execute a process with privileges that are different, and typically greater than, those they normally have. The classic model for this is the 1986 wonder tool _sudo_. It provides a way for a sysadmin to grant permissions to a user or group of users that enable them to run specific commands as some other user (note – please remember that user does NOT have to be root!!). While sudo’s single system focus makes it a poor direct solution for modern, highly distributed environments, the rules that sudo can model are wonderfully instructive. It’s even pretty smart about handling password changes on the ‘higher-level’ account.
Nearly every delivery automation framework has some notion of this concept. Certainly it is nothing new in the IT automation space – distributed orchestrators have had some notion of ‘execute these commands on those target systems as this account’ for just about as long as those tools have existed. There are even some that simply rely on the remote systems to have sudo… As with most things DevOps, the actual implementation is less important than the broader concept.
The key thing is to have an easily managed way to control things. Custom sudo rules on 500 remote systems is probably not an approach that is going to scale. The solution needs to have 3 things. First, a way to securely store the higher permission accounts. Do not skimp here – this is a potential security problem. Next, it needs to be able to authenticate the user making the request. Finally, it needs to have a rules system for mapping the requestors to the automations that they are allowed to execute – whatever form they may take.
Once the mechanics of the approach are handled and understood, the management doctrine can be established and fine tuned. The matrix of requesters and automations will grow over time, so all of the typical system issues of user groups and permissions will come into play. The simpler this is, the better off the whole team will be. That said, it needs to be sophisticated enough to enable managers, some of whom may be very invested in expertise silos, to understand that the system is sufficiently controlled to allow the non-experts to do what they need to do. Which is the whole idea of empowering the team members in the first place – give the team what they need and let them do their work.
The basic meaning of democratization (in a non-political sense) is to make something accessible to everyone. This is the core of so much software that is written today that it is highly ironic how it is rarely systematically applied to the process of actually producing software. However, it is this aspect of automation that is a key _reason_ why automation delivers such throughput benefits. By encapsulating complexity and expertise into something easily consumed, novices can perform tasks in which they are not expert and do so on demand. In other words, it makes the ‘scarce expertise’ bottleneck can be made irrelevant.
Scaled software environments are now far too complex and involve too many integrated technologies for there to be anyone who really understands all of the pieces at a detailed level. Large scale complexity naturally drives the process of specialization. This has been going on for ages in society at large and there are plenty of studies that describe how we could not really have cities if we did not parcel out all of the basic tasks of planning, running, and supplying the city to many specialists. No one can be an expert in power plants, water plants, sewage treatment plants, and all of the pumps, circuits, pipes, and pieces in them. So, we have specialists.
Specialists, however, create a natural bottleneck. Even in a large situation where you have many experts in something, the fact that the people on the scene are unable to take action means they are waiting and, people who depend on that group are waiting. A simple example is unstopping a clogged pipe. Not the world’s most complex issue, but it is a decent lens for illustrating the bottleneck factor. On one hand, if you don’t know anything about plumbing, then you have to call (and wait for) a plumber. Think of the time saved if a plumber was always right next to the drain and could jump right in and unclog that pipe.
Example problems, such as plumbing, that require physical fixes are much harder, of course. In the case of a scaled technology environment, we are fortunate to be able to work with much more malleable stuff – software and software-defined infrastructure. Before we get too excited by that, however, we should remember that, while our environments are far easier to automate than, say, a PVC pipe, we still face the knowledge and tools barrier. And the fact that technology organizations have a lot people waiting and depending on technology ‘plumbers’ is one of the core drivers of why the DevOps movement is so resonant in the first place.
Consider the situation where developers need an environment in which to build new features for an application system. If developers in that environment can click a button and have a fully operational, representative infrastructure for their application system provisioned and configured in minutes, it is because the knowledge of how to do that has been captured. That means that ever time a developer needs to refresh their environment, a big chunk of time is saved by not having to wait on the ‘plumber’ (expert). And that is before taking into account the fact that the removal of that dependency on the expert allows the developer to more frequently refresh their environment – which creates opportunities to enhance quality and productivity. And a similar value proposition exists for testers, demo environments, etc. Even if the automated process is no faster than the expert-driven approach it replaces, the removal of the wait time delivers a massive value proposition.
So, automating value lever number one is ‘power to the people’. Consider that when choosing what to automate first and how much to invest in automating that thing. It doesn’t matter how “cool” or “powerful” a concept is if it doesn’t help the masses in your organization. This should be self-evident, but you still hear people waxing on about how much faster things start in Docker. A few questions later and you figure out that they only actually start those based on one of their ops guys getting an open item through an IT ticketing system from 2003…
Much of the automation discussion in DevOps focuses on speed or, if a more enlightened conversation, throughput. That is unfortunate, because it values a narrow dimension of automation without considering what makes things actually faster. That narrowness means that underlying inefficiencies can be masked by raw power. Too often that results in a short-term win followed by stagnation and an inability to improve farther. Sometimes it even creates a win that crumbles after a short period and results in a net loss. All of which translate to a poor set of investment priorities. These next few posts will look at why automation works – the core, simple levers that make it valuable – in an attempt to help people frame their discussion, set their priorities, and make smart investments in automation as they move their DevOps initiatives forward.
To be fair, let’s acknowledge that there is a certain sexiness about speed that can distract from other important conversations. Speed is very marketable as we have seen that for ages in the sports car market. Big horsepower numbers and big top speeds always get big glowing headlines, but more often than not, the car that is the best combined package will do best around a track even if it does not have the highest top speed. As engines and power have become less of a distinguishing factor (there are a surprising number of 200+mph cars on the market now), people are figuring that out. Many top-flight performance cars have begun to talk about their ‘time around the Nurburgring’ (a legendary racetrack in Germany) as a more holistic performance measure rather than just focusing on ‘top speed’.
The concept of the complete package being more important is well understood in engineering-driven automobile racing and winning teams have always used non-speed factors to gain competitive advantage. For example, about 50 years ago, Colin Chapman the founder of Lotus cars and designer of World Championship winning racing cars, famously said that “Adding power makes you faster on the straights. Subtracting weight makes you faster everywhere”. In other words, raw speed and top speed are important, but, if you have to slow down to turn you are likely to be beaten. Currently, Audi dominates endurance racing by running diesel powered cars that have to make fewer pitstops than the competition. They compete on the principle that even the fastest racecar is very beatable when it is sitting still being fueled.
So, since balance is an important input in achieving the outcomes of speed and throughput, I think we should look at some of the balancing value levers of automation in a DevOps context. In short, a discussion of the more subtle reasons _why_ automation takes center stage when we seek to eliminate organizational silos and how to balance the various factors.
The three non-speed value levers of automation that we will look at across the next few posts are:
1 – Ability to ‘democratize’ expertise
2 – Ability to automate delegation
3 – Traceability
We had an interesting discussion the other day about what made a “good” DevOps tool. The assertion is that a good citizen or good “link” in the toolchain has the same basic attributes regardless of the part of the system for which it is responsible. As it turns out, at least with current best practices, this is a reasonably true assertion. We came up with three basic attributes that the tool had to fit or it would tend to fall out of the toolchain relatively quickly. We got academic and threw ‘popular’ out as a criteria – though supportability and skills availability has to be a factor at some point in the real world. Even so, most popular tools are at least reasonably good in our three categories.
Here is how we ended up breaking it down:
- The tool itself must be useful for the domain experts whose area it affects. Whether it be sysadmins worried about configuring OS images automatically, DBAs, network guys, testers, developers or any of the other potential participants, if the tool does not work for them, they will not adopt it. In practice, specialists will put up with a certain amount of friction if it helps other parts of the team, but once that line is crossed, they will do what they need to do. Even among development teams, where automation is common for CI processes, I STILL see shops where they have a source control system that they use day-to-day and then promote from that into the source control system of record. THe latter was only still in the toolchain due to a bureaucratic audit requirement.
- The artifacts the tool produces must be easily versioned. Most often, this takes the form of some form of text-based file that can be easily managed using standard source control practices. That enables them to be quickly referenced and changes among versions tracked over time. Closed systems that have binary version tracking buried somewhere internally are flat-out harder to manage and too often have layers of difficulty associated with comparing versions and other common tasks. Not that it would have to be a text-based artifact per se, but we had a really hard time coming up with tools that produced easily versioned artifacts that did not just use good old text.
- The tool itself must be easy to automate externally. Whether through a simple API or command line, the tool must be easily inserted into the toolchain or even moved around within the toolchain with a minimum of effort. This allows quickest time to value, of course, but it also means that the overall flow can be more easily optimized or applied in new environments with a minimum of fuss.
We got pretty meta, but these three aspects showed up for a wide variety of tools that we knew and loved. The best build tools, the best sysadmin tools, even stuff for databases had these aspects. Sure, this is proof positive that the idea of ‘infrastructure as code’ is still very valid. The above apply nicely to the most basic of modern IDEs producing source code. But the exercise became interesting when we looked at older versus newer tools – particularly the frameworks – and how they approached the problem. Interestingly we felt that some older, but popular, tools did not necessarily pass the test. For example, Hudson/Jenkins are weak on #2 and #3 above. Given their position in the toolchain, it was not clear if it mattered as much or if there was a better alternative, but it was an interesting perspective on what we all regarded as among the best in their space.
This is still an early thought, but I thought I would share the thought to see what discussion it would stimulate. How we look at tools and toolchains is evolving and maturing. A tool that is well loved by a particular discipline but is a poor toolchain citizen may not be the right answer for the overall organization. A close second that is a better overall fit might be a better answer. But, that goes against the common practice of letting the practitioners use what they feel best for their task. What do you do? Who owns that organizational strategic call? We are all going to have to figure that one out as we progress.