We had an interesting discussion the other day about what made a “good” DevOps tool. The assertion is that a good citizen or good “link” in the toolchain has the same basic attributes regardless of the part of the system for which it is responsible. As it turns out, at least with current best practices, this is a reasonably true assertion. We came up with three basic attributes that the tool had to fit or it would tend to fall out of the toolchain relatively quickly. We got academic and threw ‘popular’ out as a criteria – though supportability and skills availability has to be a factor at some point in the real world. Even so, most popular tools are at least reasonably good in our three categories.
Here is how we ended up breaking it down:
- The tool itself must be useful for the domain experts whose area it affects. Whether it be sysadmins worried about configuring OS images automatically, DBAs, network guys, testers, developers or any of the other potential participants, if the tool does not work for them, they will not adopt it. In practice, specialists will put up with a certain amount of friction if it helps other parts of the team, but once that line is crossed, they will do what they need to do. Even among development teams, where automation is common for CI processes, I STILL see shops where they have a source control system that they use day-to-day and then promote from that into the source control system of record. THe latter was only still in the toolchain due to a bureaucratic audit requirement.
- The artifacts the tool produces must be easily versioned. Most often, this takes the form of some form of text-based file that can be easily managed using standard source control practices. That enables them to be quickly referenced and changes among versions tracked over time. Closed systems that have binary version tracking buried somewhere internally are flat-out harder to manage and too often have layers of difficulty associated with comparing versions and other common tasks. Not that it would have to be a text-based artifact per se, but we had a really hard time coming up with tools that produced easily versioned artifacts that did not just use good old text.
- The tool itself must be easy to automate externally. Whether through a simple API or command line, the tool must be easily inserted into the toolchain or even moved around within the toolchain with a minimum of effort. This allows quickest time to value, of course, but it also means that the overall flow can be more easily optimized or applied in new environments with a minimum of fuss.
We got pretty meta, but these three aspects showed up for a wide variety of tools that we knew and loved. The best build tools, the best sysadmin tools, even stuff for databases had these aspects. Sure, this is proof positive that the idea of ‘infrastructure as code’ is still very valid. The above apply nicely to the most basic of modern IDEs producing source code. But the exercise became interesting when we looked at older versus newer tools – particularly the frameworks – and how they approached the problem. Interestingly we felt that some older, but popular, tools did not necessarily pass the test. For example, Hudson/Jenkins are weak on #2 and #3 above. Given their position in the toolchain, it was not clear if it mattered as much or if there was a better alternative, but it was an interesting perspective on what we all regarded as among the best in their space.
This is still an early thought, but I thought I would share the thought to see what discussion it would stimulate. How we look at tools and toolchains is evolving and maturing. A tool that is well loved by a particular discipline but is a poor toolchain citizen may not be the right answer for the overall organization. A close second that is a better overall fit might be a better answer. But, that goes against the common practice of letting the practitioners use what they feel best for their task. What do you do? Who owns that organizational strategic call? We are all going to have to figure that one out as we progress.
The last capability area in the framework is that of Monitoring. I saved this for last because it is the one that tends to be the most difficult to get right. Of course, commensurate with the difficulty is the benefit gained when it is working properly. A lot of the difficulty and benefit with Monitoring comes from the fact that knowing what to look at, when to look at it and what NOT to look at are only the first steps. It also becomes important to know what distributed tidbits of information to bring together if you actually want a complete picture of your application environment.
This post could go for pages – and Monitoring is likely going to be a consuming topic as this series progresses, but for the sake of introduction, lets look at the Monitoring capability area. The sub-capabilities for this area encompass the traditional basics of monitoring Events and Trends among them. The challenge for these two is in figuring out which Events to monitor and sometimes how to get the Event data in the first place. The Trends must then be put into a Report format that resonates with management. It is important to invest in this area in order to build trust with management that the team has control as it tries to increase the frequency of changes – without management’s buy-in, they won’t fund the effort. Finally, the Correlation sub-capability area is related to learning about the application system’s behavior and how changes to some part of the system impacts the other parts. This is an observational knowledge base that must be deliberately built by the team over time so that they can put the Events, Trends, and Reports into the most useful contexts and use the information to better understand risks and priorities when making changes to the system.
The fourth capability area is that of Provisioning. It covers the group of activities for creating all or part of an environment in which an application system can run. This is a key capability for ensuring that application systems have the capacity they need to maintain performance and availability. It is also crucial for ensuring that development and test activities have the capacity they need to maintain THEIR performance. The variance with test teams is that a strong Provisioning capability also ensures that development and test teams can have clean dev/test environments that are very representative of prorduction environments and can very quickly refresh those dev/test environments as needed. The sub-capabilities here deal with managing the consistency of envionment configurations, and then quickly building environments to a known state.
The fifth capability area is closely related to Provisioning. It is the notion of a System Registry capability. This set of capabilities deals with delivering the assumed infrastructure functions (e.g. DNS, e-mail relays, IP ranges, LDAP, etc.) that surround the environments. These capabilities must be managed in such a way that one or more changes to an application system can be added to a new or existing environment with out significant effort or disruption. In many ways this capability area is the fabric in which the others operate. It can also be tricky to get right because this capability area often spans multiple application systems.
The third capability area is that of Deployment. Deployment deals with the act of actually putting the changes into a given target environment. It is not prescritive of how this happens. Many shops mechanically deal with deployment via their provisioning system. That is obviously a good thing and an efficiency gain by removing a discrete system for performing deployment activities. It is really a best practice of the most mature organizations. However, this taxonomy model is about identifying the capabilities needed to consistently apply changes to a whole application system. And, lets face it, best practices tend to be transient; as new, even better, best practices emerge.
Additionally, there are a number of reasons the capability is included in this taxonomy. First of all, the framework is about capabilities rather than technologies or implementations. It is important to be deliberate about how changes are deployed to all environments and simply because some group of those changes are handled by a provisioning tool does not remove the fact that not all are covered nor does it remove the fact that some deliberate work is expended in fitting the changes into the provisioning tool’s structure. Most provisioning tools, for example are set up to handle standard package mechanisms such as RPM. The deployment activity in that scenario is more one of packaging the custom changes. But the provisioning answer is not necessarily a solution for all four core areas of an applpication system, so there needs to be a capability that deals generically with all of them. Finally, many, if not most, shops have some number of systems where there are legacy technical requirements that require deployment to happen separately.
All of that being true, the term “Deployment” is probably confusing given its history and popular use. It will likely be replaced in the third revision of this taxonomy with something more generic, such as “Change Delivery”.
The sub-category of Asset Repository refers to the fact that there needs to be an ability to maintain a collection of changes that can be applied singly or in bulk to a given application system. In the third revision of the taxonomy, it is likely to be joined by a Packaging sub-capability. Comments and thoughts are welcome as this taxonomy is evolving and maturing along with the DevOps movement.
This post covers the first two capability areas in the system taxonomy. This discussion will begin with where the changes come into the “system for changing systems”, Change Management, and proceed around the picture of top-level capability areas.
The first capability area to look at is Change Management. Change is the fundamental reason for this discussion and, in many ways, the discussion is pointless unless this capability is well understood. Put more simply, you can not apply changes if you do not know what the changes are. As a result, this capability area is the change injector for the system. It is where changes to the four components of the application system are identified, labeled and tracked as they are put into place in each environment. For convenience and in recognition of the fact that changes are injected from both the “new feature” angle as well as from the “maintenance item” angle, the two sources of change are each given their own capability sub-area.
The second capability area is that of Orchestration. In a complex system that is maintained by a combination of human and machine-automated prcoesses, understanding what is done, by whom, and in what order is important. This capability area has two sub-areas – one for the technical side and one for the people. This reflects the need to keep the technical dependencies properly managed and also to keep everyone on the same page. Orchestration is a logical extension of the changes themselves. Once you know what the changes are, everyone and everything must stay synchronized on when and where those changes are applied to the application system.
The first step to understanding the framework is to define the broad, top level capability areas. A very common problem in technology is the frequent over-use of terms that can have radically different meanings depending on the context of a conversation. So, as with any effort to clarify the discussion of a topic, it is very critical to define terms and hold to those definitions during the course of the discussion.
At the top level of this framework are six capability groupings
- Change Management – This category is for capabilities that ensure that changes to the system are properly understood and tracked as they happen. This is a massively overused term, but the main idea for this framework is that managing changes is not the same thing as applying them. Other capabilities deal with that. This capability category is all about oversight.
- Orchestration – This category deals with the ability to coordinate activity across different components, areas, and technologies in a complex distributed application system in a synchronized manner
- Deployment – This category covers the activities related to managing the lifecycles of an application systems’ artifacts through the various environments. Put more simply this area deals with the mechanics of actually changing out pieces of an application system.
- Monitoring – The monitoring category deals with instrumenting the environment for various purposes. This instrumentation concept covers all pieces of the application system and provides feedback in the appropriate manner for interested stakeholders. For example, capacity usage for operations and feature usage for development.
- System Registry – This refers to the need for a flexible and well-understood repository of shared information about the infrastructure in which the application system runs. This deals with the services on which the application system depends and which may need to be updated before a new instance of the application system can operate correctly.
- Provisioning – This capability is about creating and allocating the appropriate infrastructure resources for an instance of the application system to run properly. This deals with the number and configuration of those resources. While this area is related to deployment, it is separate because in many infrastructures it may not be desireable or even technically possible to provision fresh resources with each deployment and linking the two would blunt the relevancy of the framework.
The next few posts will dig into the sub-categories underneath each of these top-level items.
In the last couple of posts, we have talked about how application systems need a change application system around them to manage the changes to the application system itself. A “system to manage the system” as it were. We also talked about the multi-part nature of application systems and the fact that the application systems typically run in more than one environment at any given time and will “move” from environment to environment as part of their QA process. These first three posts seek to set a working definition of the thing being changed so that we can proceed to a working definition of a system for managing those changes. This post starts that second part of the series – defining the capabilities of a change application system. This definition will then serve as the base for the third part – pragmatically adopting and applying the capabilities to begin achieving a DevOps mode of operation.
DevOps is a large problem domain with many moving parts. Just within the first set of these posts, we have seen how four rather broad area definitions can multiply substantially in a typical environment. Further, there are aspects of the problem domain that will be prioritized by different stakeholders based on their discipline’s perspective on the problem. The whole point of DevOps, of course, is to eliminate that perspective bias. So, it becomes very important to have some method for unifying the understanding and discussion of the organizations’ capabilities. In the final analysis, it is not as important what that unified picture looks like as it is that the picture be clearly understood by all.
To that end, I have put together a framework that I use with my customers to help in the process of understanding their current state and prioritizing their improvement efforts. I initially presented this framework at the Innovate 2012 conference and subsequently published an introductory whitepaper on the IBM developerWorks website. My intent with these posts is to expand the discussion and, hopefully, help folks get better faster. The interesting thing to me is to see folks adopt this either as is or as the seed of something of their own. Either way, it has been gratifying to see folks respond to it in its nascent form and I think the only way for it to get better is to get more eyeballs on it.
So, here is my picture of the top-level of the capability areas (tools and processes) an organization needs to have to deliver changes to an application system.
The quality and maturity of these within the organization will vary based on their business needs – particularly around formality – and the frequency with which they need to apply changes.
I applied three principles when I put this together:
- The capabilities had to be things that exist in all environments that application system runs (ie dev, test, prod, or whatever layers exist). THe idea here is that such a perspective will help unify tooling and approaches to a theoretical ideal of one solution for all environments.
- The capabilities had to be broad enough to allow for different levels of priority / formality depending on the environment. The idea is to not burden a more volatile test environment with production-grade formality or vice-versa. But to allow a structured discussion of how the team will deliver that capability in a unified way to the various environments. DevOps is an Agile concept, so the notion of minimally necessary applies.
- The capabilities had to be generic enough to apply to any technology stack that an organization might have. Larger organizations may need multiple solutions based on the fact that they have many application systems that were created at different points in time, in different languages, and in different architectures. It may not be possible to use exactly the same tool / process in all of those environments, but it most certainly is possible to maintain a common understanding and vocabulary about it.
In the next couple of posts, I will drill a bit deeper into the capability areas to apply some scope, focus, and meaning.
As the DevOps movement rolls on, there is a pattern emerging. Some efforts are initiated by development, seeking relief on test environment management. Others are initiated by operations departments trying to get more automation and instrumentation into the environments they manage. I frequently hear comments that are variations on “same stuff I’ve been doing for xx years, different title” from people who have DevOps in their job title or job description. Their shops are hoping that if they encourage folks to think about DevOps and maybe try some new tools, they will get the improvements promised by DevOps discussions online. Well, just like buying a Ferrari and talking it up won’t make you Michael Schumacher, having Puppet or Chef to do your configuration management won’t “make you DevOps” (whatever that means). Successful DevOps shops are bypassing the window dressing and going at DevOps as a project unto itself.
There are a number of unique aspects to undertaking a project such as this. They require a holistic perspective on the process, touch a very broad range of activities, and provide an approach for changing other systems while being constantly changed themselves.
These projects are unique in the software organization because they require a team to look at the whole end-to-end approach to delivering changes to the application systems within that organization FROM THE SIDE; rather than from a position somewhere in the middle of the process. This is an important difference in approach, because it forces a change in perspective on the problem. Typically, someone looking from either the development or the operations “end” of the process will often suffer from a perceptive problem where the “closer” problems in the process look bigger than the ones “farther” up or down the process line. It is a very human thing to be deceived by the perspective of our current position. After all, there are countless examples of using perspective for optical illusions. Clever Leaning Tower of Pisa pictures (where someone appears to be holding it up) and the entire Lord of the Rings movie trilogy (the actors playing the hobbits are not that short) provide easy examples. Narrowness of perspective is, in fact, a frequent reason that “grassroots” efforts’ fail outside of small teams. Successfully making large and impactful changes requires a broader perspective.
The other breadth-related aspect of these programs is that they touch a very wide range of activities over time and seek to optimize for flow both through and among each. That means that they have some similarities with supply chain optimization and ERP projects if not in scale, then in complexity. And the skills to look at those flows probably do not exist directly within the software organization, but in the business units themselves. It can be difficult for technology teams, that see themselves as critical suppliers of technology to business units, to accept that there are large lessons to be learned about technology development from the business units. It takes a desire to learn and change at a level well above a typical project.
A final unique part is that there must be ongoing programs for building and enhancing a system for managing consistent and ongoing changes in other systems. Depending on your technology preference, there are plenty of analogies from pipelines, powergrids and aircraft that apply here. Famous and fun ones are the flight control systems of intrinsically unstable aircraft such as the F-16 fighter or B-2 bomber. These planes use technology to adjust control surfaces within fractions of a second to maintain steady and controled flight within the extreme conditions faced by combat aircraft. Compared to that, delivering enhancements to a release automation system every few weeks sounds trivial, but maintaining the discipline and control to do so in a large organization can be a daunting task.
So the message here is to deliberately establish a program to manage how changes are applied. Accept that it is going to be a new and unusual thing in your organization and that it is going to require steady support and effort to be successful. Without that acceptance, it will likely not work out.
My next few posts are going to dig into this deeper and begin looking at the common aspects of these programs, how people approach the problem, how they organize and prioritize their efforts, and the types of tools they are using.
The company I work for serves many large corporations in our customer base, many of whom are IBM shops with the commensurately large WebSphere installed bases. So, as you might imagine, it behooves us to keep abreast of the latest stuff IBM delivers.
We are fortunate enough to be pretty good at what we do and are in the premiere tier of IBM’s partner hierarchy and were recently able to get an IBM Workload Deployer (IWD) appliance in as an evaluation unit. If you are not familiar, the IWD is really the third revision of the appliance formerly known as the IBM WebSphere Cloudburst appliance. I do not know, but I would presume the rebrand is related to the fact that the IWD is handling more generic workloads than simply those related to WebSphere and therefore deserved a more general name.
You can read the full marketing rundown on the IBM website here: IBM Workload Deployer
This is a “cloud management in a box” solution that you drop onto your network, point at one one or more of the supported hypervisors, and it handles images, load allocation, provisioning etc. You can give it simple images to manage, but the thing really lights up when you give it “Patterns” – a term which translates to a full application infrastructure (balancing webservers, middleware, DB, etc.). If you use this setup, the IWD will let you manage your application as a single entity and maintain the connections for you.
I am not an expert on the thing – at least not yet, but a couple of other points that immediately jump out at me are:
- The thing also has a pretty rich Python-based command line client that should allow us to do some smart script stuff and maintain those in a proper source repository.
- The patterns and resources also have intelligence in them where you can’t break dependencies of a published configuration
- There are a number of pre-cooked template images that don’t seem very locked down that you can use as starter points for customization or you can roll your own.
- The Rational Automation Framework tool is here, too, so that brings up some migration possibilities for folks looking to bring apps either into a ‘cloud’ or a better managed virtual situation