Old Habits Make DevOps Transformation Hard

My father is a computer guy. Mainframes and all of the technologies that were cool a few decades ago. I have early memories of playing with fascinating electro-mechanical stuff at Dad’s office and its datacenter. Printers, plotters, and their last remaining card punch machine in a back corner. Crazy cool stuff for a kid if you have ever seen that gear in action. There’s all kinds of noise and things zipping around.

Now the interesting thing about talking to Dad is that he is seriously geeky about tech. Always fascinated by the future of how tech would be applied and he completely groks the principals and potentials of new technology even if he does not really get the specific implementations. Recently he had a problem printing from his iPhone. He had set it up a long time ago and it worked great. He’s 78 and didn’t bat an eye at connecting his newfangled mobile device to his printer. What was interesting was his behavior when the connection stopped working. He tried mightily to fix the connection definition rather than deleting the configuration and simply recreating it with the wizard. That got me thinking about “fix it” behavior and troubleshooting behavior in IT.

My dad, as an old IT guy, had long experience and training that you fix things when they got out of whack. You certainly didn’t expect to delete a printer definition back in the day – you would edit the file, you would test it, and you would fiddle with it until you got the thing working again. After all, you had just the relatively few pieces of equipment in the datacenter and offices. That makes no sense in a situation where you can simply blow the problematic thing away and let the software automatically recreate it.

And that made me think about DevOps transformations in the enterprise.

I run into so many IT shops where people far younger than my dad struggle mightily to troubleshoot and fix things that could (or should) be easily recreated. To be fair – some troubleshooting is valuable and educational, but a lot is over routine stuff that is either well known, industry standard, or just plain basic. Why isn’t that stuff in an automated configuration management system? Or a VM snapshot? Or a container? Heck – why isn’t it in the Wiki, at least?! And the funny thing is that these shops are using virtualization and cloud technologies already, but treat the virtual artifacts the same way as they did the long-lasting, physical equipment-centric setups of generations past. And that is why so many DevOps conversations come back to culture. Or perhaps ‘habit’ is a better term in this case.

Breaking habits is hard, but we must if we are to move forward. When the old ways do not work for a retired IT guy, you really have to think about why anyone still believes they work in a current technology environment.

This article is on LinkedIn here: https://www.linkedin.com/pulse/old-habits-make-devops-transformation-hard-dan-zentgraf

What Makes a Good DevOps Tool?

We had an interesting discussion the other day about what made a “good” DevOps tool.  The assertion is that a good citizen or good “link” in the toolchain has the same basic attributes regardless of the part of the system for which it is responsible. As it turns out, at least with current best practices, this is a reasonably true assertion.  We came up with three basic attributes that the tool had to fit or it would tend to fall out of the toolchain relatively quickly. We got academic and threw ‘popular’ out as a criteria – though supportability and skills availability has to be a factor at some point in the real world. Even so, most popular tools are at least reasonably good in our three categories.

Here is how we ended up breaking it down:

  1. The tool itself must be useful for the domain experts whose area it affects.  Whether it be sysadmins worried about configuring OS images automatically, DBAs, network guys, testers, developers or any of the other potential participants, if the tool does not work for them, they will not adopt it.  In practice, specialists will put up with a certain amount of friction if it helps other parts of the team, but once that line is crossed, they will do what they need to do.  Even among development teams, where automation is common for CI processes, I STILL see shops where they have a source control system that they use day-to-day and then promote from that into the source control system of record.  THe latter was only still in the toolchain due to a bureaucratic audit requirement.
  2. The artifacts the tool produces must be easily versioned.  Most often, this takes the form of some form of text-based file that can be easily managed using standard source control practices. That enables them to be quickly referenced and changes among versions tracked over time. Closed systems that have binary version tracking buried somewhere internally are flat-out harder to manage and too often have layers of difficulty associated with comparing versions and other common tasks. Not that it would have to be a text-based artifact per se, but we had a really hard time coming up with tools that produced easily versioned artifacts that did not just use good old text.
  3. The tool itself must be easy to automate externally.  Whether through a simple API or command line, the tool must be easily inserted into the toolchain or even moved around within the toolchain with a minimum of effort. This allows quickest time to value, of course, but it also means that the overall flow can be more easily optimized or applied in new environments with a minimum of fuss.

We got pretty meta, but these three aspects showed up for a wide variety of tools that we knew and loved. The best build tools, the best sysadmin tools, even stuff for databases had these aspects. Sure, this is proof positive that the idea of ‘infrastructure as code’ is still very valid. The above apply nicely to the most basic of modern IDEs producing source code. But the exercise became interesting when we looked at older versus newer tools – particularly the frameworks – and how they approached the problem. Interestingly we felt that some older, but popular, tools did not necessarily pass the test.  For example, Hudson/Jenkins are weak on #2 and #3 above.  Given their position in the toolchain, it was not clear if it mattered as much or if there was a better alternative, but it was an interesting perspective on what we all regarded as among the best in their space.

This is still an early thought, but I thought I would share the thought to see what discussion it would stimulate. How we look at tools and toolchains is evolving and maturing. A tool that is well loved by a particular discipline but is a poor toolchain citizen may not be the right answer for the overall organization. A close second that is a better overall fit might be a better answer. But, that goes against the common practice of letting the practitioners use what they feel best for their task. What do you do? Who owns that organizational strategic call? We are all going to have to figure that one out as we progress.

Slides from Agile Austin Talk 2-12-2013

This post slides from my talk at Agile Austin on February 12, 2013.

I want to thank the group for the opportunity and the audience for the great interaction!

Link to slides:  DevOps Beyond the Basics – FINAL

A System for Changing Systems – Part 3 – How Many “Chang-ee”s

As mentioned in the last post, once there is a “whole system” understanding of an application system, the next problem is that there are really multiple variants of that system running within the organization at any given time. There are notionally at least three: Development, Test, and Production. In reality, however, most shops frequently have multiple levels of test and potentially more than one Development variant. Some even have Staging or “Pre-production” areas very late in test where the modified system must run for some period before finally replacing the production environment. A lot of this environment proliferation is based on historic processes that are themselves a product of the available tooling and lessons organizations have learned over years of delivering software.

Example Environment Flow

This is a simplified, real-world example flow through some typical environments. Note the potential variable paths – another reason to know what configuration is being tested.

Tooling and processes are constantly evolving. The DevOps movement is really a reflection of the mainstreaming of Agile approaches and cloud-related technologies and is ultimately a discussion of how to best exploit it. That discussion, as it applies to environment proliferation, means we need to get to an understanding of the core problems we are trying to solve. The two main problem areas are maintaining the validity of the sub-production environments as representative of production and tracking the groupings of changes to the system in each of the environments.

The first problem area, that of maintaining the validity of sub-production envrionments, is a more complex problem than it would seem. There are organizational silo problems where multiple different groups own the different environments. For example, a QA group may own the lab configuraitons and therefore have a disconnect relative to the production team. There are also multipliers associated with technical specialities, such as DBAs or Network Administration, which may be shared across some levels of environment. And if the complexity of the organization was not enough, there are other issues associated with teams that do not get along well, the business’ perception that test environments are less critical than production, and other organizational dynamics that make it that much more difficult to ensure good testing regimes are part of the process.

The second key problem area that must be addresssed is tracking the groups of changes to the application system that are being evaluated in a particular sub-production environment. This means having a unique identifier for the combination of application code, the database schema and dataset, system configuration, and network configuration. That translates to five version markers – one for each of the main areas of the application system plus one for the particular combination of all four. On the surface, this is straightforward, but in most shops, there are few facilities for tracking versions of configurations outside of software code. Even when they are, they are too often not connected to one another for tracking groupings of configurations.

They typical pattern for solving these two problems actually begins with the second problem first. It is difficult to ensure the validity of a test environment if there is no easy way to identify and understand the configuration of the components involved. This is why many DevOps initiatives start with configuration management tools such as Puppet, Chef, or VMWare VCenter. It is also why “all-in-one” solutions such as IBM’s Pure family are starting to enter the market. Once an organization can get a handle on their configurations, then it is substantially easier to have fact-based engineering conversations about valid test configurations and environments because everyone involved has a clear reference for understanding exactly what is being discussed.

This problem discussion glosses over the important aspect of being able to maintain these tools and environments over time. Consistently applying the groups of changes to the various environments requires a complex system by itself. The term system is most appropirate because the needed capabilities go well beyond the scope of a single tool and then those capabilities need to be available for each of the system components. Any discussion of such broad capabilities is well beyond the scope of a single blog post, so the next several posts in this series will look at framework for understanding the capabilities needed for such a system.

A System for Changing Systems – Part 2 – The “Chang-ee”

As discussed last time, having a clear understanding of the thing being changed is key to understanding how to change it. Given that, this post will focus on creating a common framework for understanding the “Change-ee” systems. To be clear, the primary subject of this discussion are software application systems. That should be obvious from the DevOps discussion, but I prefer not to assume things.

Application systems generally have four main types of components. First, and most obviously, is the software code. That is often referred to as the “application”. However, as the DevOps movement has long held, that is a rather narrow definition of things. The software code can not run by itself in a standalone vacuum. That is why these posts refer to an application *system* rather than just an application. The other three parts of the equation are the database, the server infrastructure and the network insfrastructure. It takes all four of these areas working together for an application system to function.

Since these four areas will frame the discussion going forward, we need to have a common understanding about what is in each. It is important to understand that there are variants of each of these components as changes are applied and qualified for use in the production environment. In other words, there will be sub-production environments that have to have representative configurations. And those have to be considered when deciding how to apply changes through the environment.

  • Application Code – This is the set of functionality defined by the business case that justifies the existance of the application system in the first place and consists of the artifacts created by the development team for the solution including things such as server code, user interface artifacts, business rules, etc.
  • Database & Data – This is the data structure required for the application to run. This area includes all data-related artifacts, whether they are associated with a traditional RDBMS, “no sql” system, or just flat files. This includes data, data definition structures (eg schema), test datasets, and so forth.
  • Server Infrastructure (OS, VM, Middleware, Storage) – This represents the services and libraries required for the application to run. A broad category ranging from the VM/OS layer all the way through the various middleware layers and libraries on which the application depends. This area also includes storage for the database area.
  • Network Infrastructure – This category is for all of the inter-system communications components and links required for users to derive value from the application system. This includes the connectivity to the users, connectivity among servers, connectivity to resources (e.g. storage), and the devices (e.g. load balancers, routers, etc.) that enable the application system to meet its functional, performance, and availability requirements
Application System Components

Conceptual image of the main system component areas that need to be in sync in order for a system to operate correctly

The complicating factor for these four areas is that there are multiple instances of each of them that exist in an organization at any given time. And those multiple instances may be at different revision levels. Dealing with that is a discussion unto itself, but is no less critical to understanding the requirements for a system to manage your application system. The next post will examine this aspect of things and the challenges associated with it.

DevOps is about Developing a Change Application System

As the DevOps movement rolls on, there is a pattern emerging. Some efforts are initiated by development, seeking relief on test environment management. Others are initiated by operations departments trying to get more automation and instrumentation into the environments they manage. I frequently hear comments that are variations on “same stuff I’ve been doing for xx years, different title” from people who have DevOps in their job title or job description. Their shops are hoping that if they encourage folks to think about DevOps and maybe try some new tools, they will get the improvements promised by DevOps discussions online. Well, just like buying a Ferrari and talking it up won’t make you Michael Schumacher, having Puppet or Chef to do your configuration management won’t “make you DevOps” (whatever that means). Successful DevOps shops are bypassing the window dressing and going at DevOps as a project unto itself.

There are a number of unique aspects to undertaking a project such as this. They require a holistic perspective on the process, touch a very broad range of activities, and provide an approach for changing other systems while being constantly changed themselves.

These projects are unique in the software organization because they require a team to look at the whole end-to-end approach to delivering changes to the application systems within that organization FROM THE SIDE; rather than from a position somewhere in the middle of the process. This is an important difference in approach, because it forces a change in perspective on the problem. Typically, someone looking from either the development or the operations “end” of the process will often suffer from a perceptive problem where the “closer” problems in the process look bigger than the ones “farther” up or down the process line. It is a very human thing to be deceived by the perspective of our current position. After all, there are countless examples of using perspective for optical illusions. Clever Leaning Tower of Pisa pictures (where someone appears to be holding it up) and the entire Lord of the Rings movie trilogy (the actors playing the hobbits are not that short) provide easy examples. Narrowness of perspective is, in fact, a frequent reason that “grassroots” efforts’ fail outside of small teams. Successfully making large and impactful changes requires a broader perspective.

The other breadth-related aspect of these programs is that they touch a very wide range of activities over time and seek to optimize for flow both through and among each. That means that they have some similarities with supply chain optimization and ERP projects if not in scale, then in complexity. And the skills to look at those flows probably do not exist directly within the software organization, but in the business units themselves. It can be difficult for technology teams, that see themselves as critical suppliers of technology to business units, to accept that there are large lessons to be learned about technology development from the business units. It takes a desire to learn and change at a level well above a typical project.

A final unique part is that there must be ongoing programs for building and enhancing a system for managing consistent and ongoing changes in other systems. Depending on your technology preference, there are plenty of analogies from pipelines, powergrids and aircraft that apply here. Famous and fun ones are the flight control systems of intrinsically unstable aircraft such as the F-16 fighter or B-2 bomber. These planes use technology to adjust control surfaces within fractions of a second to maintain steady and controled flight within the extreme conditions faced by combat aircraft. Compared to that, delivering enhancements to a release automation system every few weeks sounds trivial, but maintaining the discipline and control to do so in a large organization can be a daunting task.

So the message here is to deliberately establish a program to manage how changes are applied. Accept that it is going to be a new and unusual thing in your organization and that it is going to require steady support and effort to be successful. Without that acceptance, it will likely not work out.

My next few posts are going to dig into this deeper and begin looking at the common aspects of these programs, how people approach the problem, how they organize and prioritize their efforts, and the types of tools they are using.

Another Example of Grinding Mental Gears

I recently got a question from a customer who was struggling with the ‘availability’ of their sub-production environments. The situation brought into focus a fundamental disconnect between the Ops folks who were trying to maintain a solid set of QA environments for the Dev team and what the Dev teams needed. To a large extend this is a classic DevOps dilemma, but the question provides an excellent teaching moment. Classic application or system availability as defined for a production situation does not really apply to Dev or multi-level Test environments.

Look at it this way. End user productivity associated with a production environment is based upon the “availability” of the application. Development and Test productivity is based upon the ability to view chagnes to the application in a representative (pre-production) environment. In other words the availability of the _changer_ in pre-production is more valuable to Dev productivity than any specific pre-production instance of the application environment. Those application environment instances are, in fact, disposable by definition.

Disposability of a running application environment is a bit jarring to Ops folks when they see a group of users (developers and testers in this case) needing the system. Everything in Ops tools and doctrine is oriented toward making sure that an application environment gets set up and STAYS that way. That focus on keeping things static is exactly the point to which DevOps is a reaction.  Knowing that does not make it easy to make the mental shift, of course.  Once made, however, it is precisely why tools that facilitate rapidly provisioning environments are frequently the earliest arrivals when most organizations seek to adopt DevOps.

About Those “QA” Environments…

DevOps is about getting developed software to users faster and getting feedback from that software back to developers faster. The notion of a clean cycle with very low latency is a compelling vision. Most IT shops are struggling with how to get there and many must maintain a division between the full production environment and the test environments that lead to production. Some of the reasons are historical, some are managerial, and some are truly related to the business environment in which that organization operates.

Fortunately or unfortunately, most shops have plenty to do before they need to worry about tying production in more directly. The state of QA environments is usually relatively weakly managed. Part of that is historic – the environments are not viewed as very important and it is never quite clear who owns keeping them properly current to the production environment. Part of this is the traditional focus of narrowly testing features of the application to minimize test cycles. And part, too, is a historical view that lab configurations did not matter as long as they were ‘close enough’ for testing features.

Of course recent lessons are teaching us that the historical approach is not necessarily conducive to rapid iteration of software. We are also learning that theoretically small changes to the production environments can potentially invalidate theoretically tested deliverables. Combined with the advent of very good rapid provisioning systems, automated configuration management tools, and highly virtualized infrastructures there are few reasons to not have a first-rate QA environment.

But is the paradigm of QA “environments” really the right paradigm for how we approach rapidly releasing features into the wild? As teams try to lessen the notion of a big, standing lab environment for testing software, the approach looks somewhat less like traditional testing and more like qualifying a new feature for use in the system. This is a subtle difference. “Quality Assured” and “Qualified for Use” are two different notions. One says you delivered what you set out to deliver. The other says you know it works in some situation. Some would say that “Quality” implies the latter, but I would answer that if you have to parse a definition to get a meaning, you probably are using the wrong word.

But words only matter to a point, it is the pardigm they represent that is ultimately interesting and impactful. There are extreme examples in the “real” world. For example, just because a part is delivered with quality to its design goals does NOT mean that it is certified for use in a plane. As someone who flies often, I view this as good.

So the question I would ask is whether you simply test to see if it meets design goals in some hopefully representative lab somewhere or do you use DevOps techniques to truly qualify releases for use in the real production environment.

Change Mis-management (Part 3)

For part three of the Change Mis-management series, I want to pick on the tradition of NOT keeping system management scripts in version control.  This is a fascinating illustration of the cultural difference between Development and Operations.  Operations is obsessed with ensuring stability and yet tolerates fairly loose control over things that can decimate the environment at the full speed of whatever machine happens to be running the script.  Development is obsessed with making incremental changes to deliver value and would never tolerate such loose control over their code.  I have long speculated that this level of discipline for Development is in fact a product of the fact that they have to deal with and track a LOT of change.

Whatever the cause and whether or not you believe in Agile and/or the DevOps movement, this is really a fundamental misbehavior and we all know it.  There really is no excuse for not doing it.  Most shops have scripts that control substantial swaths of the infrastructure.  There are various application systems that depend on the scripts to ensure that they can run in a predictable way.  For all intents and purposes these scripts represent production-grade code.

This is hopefully not a complex problem to explain or solve.  The really sad part is that every software delivery shop of any size already has every tool needed to version manage all of their operations scripts.  There is no reason that there can’t be an Ops Scripts tree in your source control system.  Further, those repositories are often set up with rules that force some sort of notation for the changes that are being put into those scripts and will track who checked it in, so you have better auditing right out of the gate.

Further, you now have a way to, if not know, then at least have a good idea, what has been run on the systems.  That is particularly important if the person who ran the script is not available for some reason.  If your operations team can agree on the doctrine always running the ‘blessed’ version and never hack it on the filesystem, then life will get substantially better for everyone.  Of course, the script could be changed after checkout and the changes not logged.  Any process can be circumvented – most rather easily when you have root.  The point is to make such an event more of an anomaly.  Maybe even something noticeable – though I will talk about that in the next part of this series.

This is really just a common-sense thing that improves your overall organizational resilience.  Repeat after me:

  • I resolve to always check in my script changes.
  • I resolve to never run a script unless I have first checked it out from source to make sure I have the current version.
  • I resolve to never hack a script on the filesystem before I run it against a system someone other than me depends on.  (Testing is allowed before check-in; just like for developers)
  • I resolve to only run scripts of approved versions that I have pulled out of source control and left unmodified.

It is good, it is easy, it does not take significant time to do and saves countless time-consuming screw-ups.  Just do it.

Change Mis-management (Part 2)

In my last post, I mentioned three things that need to be reliably happening in order to achieve a faster, more predictable release process.  The first one was to unify change management for the system between the Ops and Development sides.  On the surface,  this sounds like a straightforward thing.  After all a tool rationalization exercise is almost routine in most shops.  It happens regularly due to budget reviews, company mergers or acquisitions, etc.

Of course, as we all know, it is never quite that easy even when the unification is happening for nominally identical teams.  For example, your company buys another and you have to merge the accounting system.  Pretty straightforward – money is money, accountants are accountants, right?  Those always go perfectly smoothly, right?  Right?

In fact, unifying this aspect of change management can be especially thorny because of the intrinsic differences in tradition between the organizations.  Even though complex modern systems evolve together as a whole, few sysadmins would see themselves as ‘developing’ the infrastructure, for example.  Additionally, there are other problems.  For instance, operations are frequently seen as service providers who need to provide SLAs for change requests that are submitted through the change management system.  And a lot of operational tracking in the ticketing system is just that – operational – and it does not relate to actual configuration changes or updates to the system itself.

The key to dealing with this is the word “change”.  Simplified, system changes should be treated in the same way as code changes are handled.  Whatever that might be.  For example, it could be a user story in the backlog.  The “user” might be a middleware patch that a new feature depends on and the work items might be around submitting tickets to progressively roll that up the environment chain into production.  The goal is to track needed changes to the system as first-class items in the development effort.  The non-change operational stuff will almost for sure stay in the ticketing system.  A simple example, but applying the principle will mean that the operating environment of a system evolves right along with its code – not as a retrofit or afterthought when it is time to deploy to a particular environment or there is a problem.

The tool part is conceptually easy – someone manages the changes in the same system that backlog/stories/work items are handled.  However, there  is also the matter of the “someone” mentioned in that sentence.  An emerging pattern I have seen in several shops is to cohabitate an Ops-type with the development team.  Whether these people are the ‘ops representative’ or ‘infrastructure developers’ their role is to focus on evolving the environment along with the code and ensuring that the path to production is in sync with how things are being developed.  This is usually a relatively senior person who can advise on things, know when to say no, and know when to push.  The real shift is that changes to the operating environment of an application become first-class citizens at the table when code or test changes are being discussed and they can now be tracked as part of the work that is required to deliver on an iteration.

These roles have started popping up in various places with interesting frequency.  To me, this is the next logical step in Agile evolution.  Having QA folks in the standups is accepted practice nowadays and organizations are figuring out that the Ops guys should be at the table as well.  This does a great job of pro-actively addressing a lot of the release / promotion headaches that slow things up as things move toward production.  Done right, this takes a lot stress and time out of the overall Agile release cycle.