The “Other” Value Levers of Automation – Part 4 – Traceability

Computers are far better at keeping records than humans and good logs are a crucial part of getting value from automating anything. Sometimes this aspect of automation gets lumped in with logging, but there is a difference between recording events and providing traceability. Both have value – the history of what happens in a system is important for a range of reasons ranging from the reactive to the proactive. On the reactive end, this record provides root cause analysis – an understanding of who did what and what happened. As things shift toward the proactive end of things, the valuable information can be used to trace how well an automated process is working, identify how it is evolving, and where it can be improved.

Starting at the basic end of the automation traceability spectrum it the simple concept of access and event logs. The very word ‘traceability’ often calls to mind the idea of auditors, investigators, and even inquisitors seeking to answer the question ‘who did what and when did they do it?’ In some organizations this is a very critical part of the business and is a valuable part of automation because it makes answering this question much easier. There is no time lost by having staff dredge up records and history. The logs are available to be turned into reports by anyone who might be interested. The productivity saved by letting people get on with their work while ensuring that those whose work it is to ensure the business meets its regulatory requirements can also get on with theirs. It actually is a true win-win, even if it is an awkward topic at times.

The other great reactive value lever of traceability in an automated environment is that it eases root cause analysis when problems occur. No system is perfect and they will always break down. The automation may even work perfectly, but still let an unforeseen problem escape into production. Good records of what happened facilitate root cause analysis. That saves time and trouble as engineers seek to figure out how to fix the problem at hand and are then tasked with making sure that the problem can never happen again. With good traceability, both sides of the task are less costly and time-consuming. Additionally, the resultant fix is more likely to be effective because there is more and better information available to create it.

Closely related to using traceability for root cause analysis and fixes is the notion of ensuring the automated process’ own health. Is there something going on with the process that could cause it to break down? This is much like a driver noticing that their car is making a new squeaking noise and taking it to the mechanic before major damage is done. The benefit of catching a potential problem early is, of course, that it can be dealt with before it causes an unplanned, costly disruption.

The fourth way that traceability makes automation valuable is that it provides the data required to perform continuous improvement. This notion is about being able to use the data produced by the automation to make something that is working well work better. While ‘better’ may have many definitions depending on the particular context or circumstance being discussed, there can be no structured way of achieving ANY definition of ‘better’ without being able to look at consistent data. And there are few better ways to get consistent data than to have it produced automatically as part of the process on which it is reporting.

Reaching the more proactive end of this spectrum requires time and a consistent effort to mature the tools, automations, and organization. However, traceability of automation builds on itself and is, in fact, the one of the three levers discussed in these posts that has the potential to build progressively more value the longer it is in use with no clear upper limit. That ability to return progressive value makes it worth the patience and discipline required.

Advertisements

The “Other” Value Levers of Automation – Part 3 – Everybody is Empowered

Implicit in DevOps automation is the idea that the decision to make technical changes should be delegated to non-experts in the first place. Sure, automation can make an expert more productive, but as I discussed in my last post, the more people who can leverage the automation, the more valuable the automation is. So, the next question is how to effectively delegate the automation so that the largest number of people can leverage it – without breaking things and making others non-productive as a result.

This is a non-trivial undertaking that becomes progressively more complex with the size of the organization and the number of application systems involved. For bonus points, some industries are externally mandated to maintain a separation of duties among the people working on a system. There needs to be a mechanism through which a user can execute an automated process with higher authority than they normally have. Those elevated rights need to last only for time when that execution is running and limit the ability to affect the environment to a scope that is appropriate. Look at it this way – continuous delivery to production does not imply giving root to every developer on the team so they can push code. There are limits imposed by what I call a ‘credentials proxy’.

A credentials proxy is simply a mechanism that allows a user to execute a process with privileges that are different, and typically greater than, those they normally have. The classic model for this is the 1986 wonder tool _sudo_. It provides a way for a sysadmin to grant permissions to a user or group of users that enable them to run specific commands as some other user (note – please remember that user does NOT have to be root!!). While sudo’s single system focus makes it a poor direct solution for modern, highly distributed environments, the rules that sudo can model are wonderfully instructive. It’s even pretty smart about handling password changes on the ‘higher-level’ account.

Nearly every delivery automation framework has some notion of this concept. Certainly it is nothing new in the IT automation space – distributed orchestrators have had some notion of ‘execute these commands on those target systems as this account’ for just about as long as those tools have existed. There are even some that simply rely on the remote systems to have sudo… As with most things DevOps, the actual implementation is less important than the broader concept.

The key thing is to have an easily managed way to control things. Custom sudo rules on 500 remote systems is probably not an approach that is going to scale. The solution needs to have 3 things. First, a way to securely store the higher permission accounts. Do not skimp here – this is a potential security problem. Next, it needs to be able to authenticate the user making the request. Finally, it needs to have a rules system for mapping the requestors to the automations that they are allowed to execute – whatever form they may take.

Once the mechanics of the approach are handled and understood, the management doctrine can be established and fine tuned. The matrix of requesters and automations will grow over time, so all of the typical system issues of user groups and permissions will come into play. The simpler this is, the better off the whole team will be. That said, it needs to be sophisticated enough to enable managers, some of whom may be very invested in expertise silos, to understand that the system is sufficiently controlled to allow the non-experts to do what they need to do. Which is the whole idea of empowering the team members in the first place – give the team what they need and let them do their work.

The “Other” Value Levers of Automation – Part 2 – Democratizing Expertise

The basic meaning of democratization (in a non-political sense) is to make something accessible to everyone. This is the core of so much software that is written today that it is highly ironic how it is rarely systematically applied to the process of actually producing software. However, it is this aspect of automation that is a key _reason_ why automation delivers such throughput benefits. By encapsulating complexity and expertise into something easily consumed, novices can perform tasks in which they are not expert and do so on demand. In other words, it makes the ‘scarce expertise’ bottleneck can be made irrelevant.

Scaled software environments are now far too complex and involve too many integrated technologies for there to be anyone who really understands all of the pieces at a detailed level. Large scale complexity naturally drives the process of specialization. This has been going on for ages in society at large and there are plenty of studies that describe how we could not really have cities if we did not parcel out all of the basic tasks of planning, running, and supplying the city to many specialists. No one can be an expert in power plants, water plants, sewage treatment plants, and all of the pumps, circuits, pipes, and pieces in them. So, we have specialists.

Specialists, however, create a natural bottleneck. Even in a large situation where you have many experts in something, the fact that the people on the scene are unable to take action means they are waiting and, people who depend on that group are waiting. A simple example is unstopping a clogged pipe. Not the world’s most complex issue, but it is a decent lens for illustrating the bottleneck factor. On one hand, if you don’t know anything about plumbing, then you have to call (and wait for) a plumber. Think of the time saved if a plumber was always right next to the drain and could jump right in and unclog that pipe.

Example problems, such as plumbing, that require physical fixes are much harder, of course. In the case of a scaled technology environment, we are fortunate to be able to work with much more malleable stuff – software and software-defined infrastructure. Before we get too excited by that, however, we should remember that, while our environments are far easier to automate than, say, a PVC pipe, we still face the knowledge and tools barrier. And the fact that technology organizations have a lot people waiting and depending on technology ‘plumbers’ is one of the core drivers of why the DevOps movement is so resonant in the first place.

Consider the situation where developers need an environment in which to build new features for an application system. If developers in that environment can click a button and have a fully operational, representative infrastructure for their application system provisioned and configured in minutes, it is because the knowledge of how to do that has been captured. That means that ever time a developer needs to refresh their environment, a big chunk of time is saved by not having to wait on the ‘plumber’ (expert). And that is before taking into account the fact that the removal of that dependency on the expert allows the developer to more frequently refresh their environment – which creates opportunities to enhance quality and productivity. And a similar value proposition exists for testers, demo environments, etc. Even if the automated process is no faster than the expert-driven approach it replaces, the removal of the wait time delivers a massive value proposition.

So, automating value lever number one is ‘power to the people’. Consider that when choosing what to automate first and how much to invest in automating that thing. It doesn’t matter how “cool” or “powerful” a concept is if it doesn’t help the masses in your organization. This should be self-evident, but you still hear people waxing on about how much faster things start in Docker. A few questions later and you figure out that they only actually start those based on one of their ops guys getting an open item through an IT ticketing system from 2003…