Rollback Addiction

A lot of teams are fascinated with the notion of ‘rollback’ in their environments.  They seem to be addicted to its seductive conceptual simplicity.  It does, of course, have valid uses, but, like a lot of things, it can become a self-destructive dependency if it is abused.  So, let’s take a look at some of the addictive properties of rollback and how to tell if you might have a problem.

Addictive property #1 – everybody is doing it.  One of the first things we learn in technology (Dev or Ops) is to keep a backup of a file when we change something.  That way, we know what we did in case it causes a problem.  In other words, our first one is free and was given to us by our earliest technology mentors.  As a result, everyone knows what it is.  It is familiar and socially acceptable.  We learn it while we are so professionally young that it becomes a “comfort food”.  The problem is that it is so pervasive that people do it without noticing.  They will revert things without giving it a second thought because it is a “good thing”.  And this behavior is a common reason rollback does not scale.  In a large system, where many people might be making changes, others might make changes based on your changes.  So, undoing yours without understanding others’ dependencies, means that you are breaking other things in an attempt to fix one thing.  If you are in a large environment and changes “backward” are not handled with the exact same rigor as changes forward, you might have a problem.

Addictive property #2 – it makes you feel good and safe.  The idea that “you are only as good as your last backup” is pretty pervasive.  So, the ability to roll something back to a ‘known good’ state gives you that warm, fuzzy feeling that it’s all OK.  Unfortunately, in large scale situations with any significant architectural complexity, it is probably not OK.  Some dependency is almost certainly unknown, overlooked or assumed to be handled manually.  That will lead to all sorts of “not OK” when you try to roll back.  If rollback is the default contingency plan for every change you make and you don’t systematically look at other options to make sure it is the right answer, you might have a problem.

Addictive property #3 – It is easy to sell.  Management does not understand the complexity of what is required to implement a set of changes, but they do understand “Undo”.  As a result, it is trivial to convince them that everything is handled and, if there should be a problem with a change, you can just ‘back it out’.  Being able to simplify the risks to an ‘undo’ type of concept can eliminate an important checkpoint from the process.  Management falls into the all to human behavior of assuming there is an ‘undo’ for everything and stops questioning the risk management plan because they think it is structurally covered.  This leads to all sorts of ugliness should there be a problem and the expectation of an easy back-out is not met.  Does your team deliberately check for oversights its contingency plan every time or does it assume that it will just ‘roll it back’?  If the latter, you might have a problem.

As usual, the fix for a lot of this is self-discipline.  The discipline to do things the hard and thorough way that takes just that little bit longer.  The discipline to institutionalize and reward healthy behaviors in your shop.  And, as usual, that goes against a fair bit of human nature that can be very difficult to overcome, indeed.

DevOps is NOT a Job Title

Given my recent posts about organizational structure, I feel like I need to clarify my stance on this…

You know a topic is hot when recruiters start putting it in job titles.  I do believe that most organizations will end up with a team of “T-shaped people” focused on using DevOps techniques to ensure that systems can be support an Agile business and its development processes.  However, I am not a fan of hanging DevOps on the title of everyone involved.

Here’s the thing, if you have to put it in the name to convince yourself or other people you are doing it, you probably are not.  And the very people you hope to attract may well avoid your organization because it fails the ‘reality’ test.  In other words, you end up looking like you don’t get it.  A couple of analogies come to mind immediately.

  • First, let’s look at a country that calls itself the “People’s Democratic Republic of” somewhere.  That is usually an indicator that it is not any of those modifiers and the only true statement is the ‘somewhere’ part.  Similarly, putting “DevOps Sysadmin” on top of a job description that, just last week, said “Sysadmin” really isn’t fooling anyone.
  • Second, hanging buzzwords on job titles is like a 16 year old painting racing stripes on the four door beater they got as their first car.  With latex house paint.  You may admire their enthusiasm and optimism.  You certainly wish them the best.  But you have a pretty realistic assessment of the car.

Instead, DevOps belongs down in the job description.  DevOps in a job role is a mindset and an approach used to define how established skills are applied.  You are looking for a Release Manager to apply DevOps methods in support of your web applications.  Put it down in the requirements bullet points just as you would put things like ‘familiar with scripting languages’, ‘used to operating in an [Agile/Lean/Scrum] environment]’, or ‘experience supporting a SaaS infrastructure’.

I realize that I am tilting at windmills here.  We went through a spate of “Agile” Development Managers and the number of  “Cloud” Sysadmins is just now tapering.  So, I guess it is DevOps’ turn.  To be sure, it is gratifying and validating to see such proof that DevOps is becoming a mainstream topic.  I should probably adopt a stance of ‘whatever spreads the gospel to the masses’.  But I really just had to get this rant off my chest after seeing a couple of serious “facepalm” job ads.

How a DOMO Fits

In my previous post, I discussed the notion of a DevOps Management Organization or DOMO.  As I said there, this is and idea that is showing up under different names at shops of varying sizes.  I thought I would share a drawing of one to serve as an example.  The basic structure is, of course, a matrix organization with the ability to have each key role present within the project.  It also provides for shared infrastructure services such as support and data.  You could reasonably easily replace the Business Analyst (BA) role with a Product Owner / Product Manager role and change “Project” to “Product” and have a variant of this structure that I have seen implemented at a couple of SaaS providers around Austin.

This structure does assume a level of maturity in the organization as well as the underlying infrastructure.  It is useful to note that the platform is designated as a “DevOps Platform”.  It would probably be better to phrase that as a cloud-type platform – public, private or hybrid – where the permanence of a particular image is low, but the consistency and automation is high.  To be sure, not all environments have built such an infrastructure, but many, if not most, are building them aggressively.  The best time to look at the organization is while those infrastructures are being built and companies are looking for the best ways to exploit them.

Organization with DOMO

Rise of the DOMO

A lot of the DevOps conversations I have had lately have been around organization issues and what to do about the artificial barriers and silos that exist in most shops.  Interestingly there is a pattern emerging among those discussing or even implementing changes to deal with this discussion.  The pattern involves matrix organizations and the rise of what I call a “DevOps Management Organization” (DOMO).  The actual names vary, but the role of the organization is consistent.

Most software delivery organizations end up with some kind of matrix where product managers, project managers, engineers, architects, QA are all tied to the success of a given project / product while maintaining a discipline-specific tie.   In the case of an ISV, you can add some other disciplines around fulfillment, support, etc.  The variable is whether the project/ product is their direct reporting organization or they report to a discipline.  And the answer can depend on the role.  For example, a Project Manager might be part of the engineering organisation and be a participant in the company’s PMO.  Or they might be part of the PMO and simply assigned to (and funded by) the project / product they are working on.

When you add Agile to the mix, the project dimension tends to take on a much higher level of primacy over the disciplines since the focus becomes much tighter on getting working software produced on a much tighter iteration timeline.  This behavior leads to the DevOps discussion as the project/product team discovers that there is no direct alignment of Operations to the project efforts.  Additionally, Operations is a varied / multi-disciplinary space in and of itself.  Thus it becomes extremely difficult for the project/product team to drive focused activity through Operations to deliver a particular iteration/release.  The classic DevOps problem.

The recent solution trend to this problem has been the creation of what I call “Ops Sherpa” roles in the project/product teams.  This role is a cross-disciplinary Ops generalist who is charged with understanding the state of the organization’s operations environment and making sure that the development effort is aligned with operational realities.  That includes full lifecycle responsibility – from ensuring that Dev and QA environments are relevant equivalent configurations to production in order make sure deliverables are properly qualified to making sure that various Operations disciplines are aware of (and understand) any changes that will be required to support a particular release deliverable.  In more mature shops, this may grow out of an existing Release Management role or, if a particularly large

Critically, though, this role gets matrixed back into the Operations organization at a high enough level to sponsor cross-operational silo action.  This point is the head of what I call the Head of the DOMO and provides the point of leverage to deal with tactical problem as well as the strategic guidance role to drive cross-project continuous improvement into the operations platform space in support of faster execution (aka DevOps speed releases).

Whatever the name,  the fact that large scale companies are recognizing the value of deliberately investing in this space is a validation that being good at release execution is strategic to cost-effectively shortening release cycles.

Classic Metrics for How Good You Are

One of the ‘best metrics ever’ is the classic “bus number”.  This measures how many people in an organization can be hit by a bus before that organization’s operations or progress is severely hindered due to that person’s absence.  This was a slightly funny way of measuring resilience of an organization versus the anti-pattern of knowledge hoarding in an individual’s brain.  The idea is that a resilient organization should have a very high bus number and not be vulnerable to ‘critical staff’.

Think about it next time you are looking at any part of your system.  Ask yourself who you would ask about that particular module, image, or whatever.  Then ask yourself who you would go to if the first person was unavailable.  How confident are you that you would quickly / expediently get your answer?  How confident are you that you could just look the information up in a Wiki or other documentation?

If you look at the questions above and start thinking that ‘we would figure it out after a while’ or making other excuses, you minimally have a problem with communications and collaboration.  You almost certainly have a process problem.  And you may well have a cultural problem.  Make no mistake, what it means is that your team/organization/project are playing in traffic and simply waiting for the inevitable to happen.

And when something does happen, say, for example, one of the project’s “hero coders” takes a new job, it will be miserable for all who remain as they try to figure out what the hero was doing.  Meanwhile the project’s progress languishes and the deadline becomes unachievable.  Morale goes down as frustration goes up.  Maybe someone else decides to leave out of a sense of futility; making the problem worse.  And it will have been completely avoidable.  It will be completely the fault of the leadership that was either not assertive enough to make the hero share their knowledge or undisciplined enough to not include sustainability in their coaching, plans, and day-to-day execution priorities.

This is serious stuff and is worth the investment of time to solve.  The habit of focusing on the overall sustainability of the organization is well documented as something that successful, resilient, and sustainable organizations emphasize.  This is well documented in the classic book “Good to Great” by Jim Colins, where the book describes the organizations being built as sophisticated machines using the analogy of clock building.  The notion is simple, really.  That the goal is to build a lasting thing that continues on as people come and go.  The project / organization must be bigger than any individual, the individuals involved must understand that, and management must encourage or enforce that mindset.  In the book, the organizations that did this radically outperformed their peers in the same markets in the same timeframe.

The reality is that you will probably always have some stuff (ideally only non-critical or very new stuff) that is not well disseminated, but take those in the lens of what they are and triage / prioritize them so that you do not accumulate the knowledge gap as technical debt.  Or, if you do, you should do so consciously, visibly, and at a level at which you know you can tolerate the risk.

Managing To Get To Agile is Harder

There are a variety of jokes or snarky comments made about management.  A lot of them are modern echoes of Industrial era factory practices.  And that is the hard thing – most large corporate management doctrines are merely evolutions of principles established in the industrial era.  That era was characterized by hyper-specialization of jobs so that the companies could achieve economies of scale, from which competitive advantage could be derived.  That, of course, led to a whole system of policies, rules, and even laws that reflected that era.

Of course, competitive advantage from economies of scale is not a focus today.  Business moves too fast.  They key advantages are responsiveness to the changing market and the ability to exploit shorter-lived opportunities.  Even manufacturing processes have evolved to enable faster retooling to serve different markets with the same facility and equipment.  Agile development and DevOps are how the need for business agility gets reflected in IT.

Despite these market dynamics, the people management doctrine for most businesses still looks like an old-school industrial approach.  There is still a push to specialized roles and to do appraisals within a narrow set of rules for that specialty.  The generalization that is intrinsic to agile execution is not valued and corporate structures often limit what lower and middle level managers can do in terms of incenting the behaviors of folks on their teams.  A lot of that is based on the fact that the HR structures are designed to stay within a narrow band of safe and easily defended legal structures so that a pissed-off employee can’t really sue if there is a problem with perceived fairness.

These things are easier to achieve in smaller companies where there is not the same legacy and, frankly, there just isn’t as much sue-able money.  It is naive to not think about these aspects and would be patently unfair to simply slam things for being the way they are.  Things are the way they are for a lot of very good and very complex reasons; some of which are beyond the direct control of the business.

It is solvable, of course.  You can use creative organizational structures that put nominal specialists ‘on assignment’ in other specialty teams.  A popular extension of this is full-on matrix management.  A variation on the matrix is to have people farmed out to project teams in a similar manner to how consulting companies do things.

The common point of these solutions is that they require managers to work together in new ways.  They require managers of managers to encourage good team dynamics for their teams of managers.  They require a lot of communication and interaction among managers and to scattered teams.  Lower-level managers will have to be empowered to invest in and coach their teams into adaptable groups with good team dynamics.  It means the managers will have to be a lot more “hands-on” and leader-like rather than manager-like than they might be used to.

That is a lot harder for all levels of management than scientifically managing a group of theoretically interchangeable specialists.  The odds are that the managers are not trained for leadership skills relative to management skills.  So, as the organization goes Agile, make sure that the investment includes an investment in how to actually manage in an Agile environment.  It really is different.  It really takes an investment.  And it really will eventually take structural change.

Change Mis-management (Part 3)

For part three of the Change Mis-management series, I want to pick on the tradition of NOT keeping system management scripts in version control.  This is a fascinating illustration of the cultural difference between Development and Operations.  Operations is obsessed with ensuring stability and yet tolerates fairly loose control over things that can decimate the environment at the full speed of whatever machine happens to be running the script.  Development is obsessed with making incremental changes to deliver value and would never tolerate such loose control over their code.  I have long speculated that this level of discipline for Development is in fact a product of the fact that they have to deal with and track a LOT of change.

Whatever the cause and whether or not you believe in Agile and/or the DevOps movement, this is really a fundamental misbehavior and we all know it.  There really is no excuse for not doing it.  Most shops have scripts that control substantial swaths of the infrastructure.  There are various application systems that depend on the scripts to ensure that they can run in a predictable way.  For all intents and purposes these scripts represent production-grade code.

This is hopefully not a complex problem to explain or solve.  The really sad part is that every software delivery shop of any size already has every tool needed to version manage all of their operations scripts.  There is no reason that there can’t be an Ops Scripts tree in your source control system.  Further, those repositories are often set up with rules that force some sort of notation for the changes that are being put into those scripts and will track who checked it in, so you have better auditing right out of the gate.

Further, you now have a way to, if not know, then at least have a good idea, what has been run on the systems.  That is particularly important if the person who ran the script is not available for some reason.  If your operations team can agree on the doctrine always running the ‘blessed’ version and never hack it on the filesystem, then life will get substantially better for everyone.  Of course, the script could be changed after checkout and the changes not logged.  Any process can be circumvented – most rather easily when you have root.  The point is to make such an event more of an anomaly.  Maybe even something noticeable – though I will talk about that in the next part of this series.

This is really just a common-sense thing that improves your overall organizational resilience.  Repeat after me:

  • I resolve to always check in my script changes.
  • I resolve to never run a script unless I have first checked it out from source to make sure I have the current version.
  • I resolve to never hack a script on the filesystem before I run it against a system someone other than me depends on.  (Testing is allowed before check-in; just like for developers)
  • I resolve to only run scripts of approved versions that I have pulled out of source control and left unmodified.

It is good, it is easy, it does not take significant time to do and saves countless time-consuming screw-ups.  Just do it.

Change Mis-management (Part 2)

In my last post, I mentioned three things that need to be reliably happening in order to achieve a faster, more predictable release process.  The first one was to unify change management for the system between the Ops and Development sides.  On the surface,  this sounds like a straightforward thing.  After all a tool rationalization exercise is almost routine in most shops.  It happens regularly due to budget reviews, company mergers or acquisitions, etc.

Of course, as we all know, it is never quite that easy even when the unification is happening for nominally identical teams.  For example, your company buys another and you have to merge the accounting system.  Pretty straightforward – money is money, accountants are accountants, right?  Those always go perfectly smoothly, right?  Right?

In fact, unifying this aspect of change management can be especially thorny because of the intrinsic differences in tradition between the organizations.  Even though complex modern systems evolve together as a whole, few sysadmins would see themselves as ‘developing’ the infrastructure, for example.  Additionally, there are other problems.  For instance, operations are frequently seen as service providers who need to provide SLAs for change requests that are submitted through the change management system.  And a lot of operational tracking in the ticketing system is just that – operational – and it does not relate to actual configuration changes or updates to the system itself.

The key to dealing with this is the word “change”.  Simplified, system changes should be treated in the same way as code changes are handled.  Whatever that might be.  For example, it could be a user story in the backlog.  The “user” might be a middleware patch that a new feature depends on and the work items might be around submitting tickets to progressively roll that up the environment chain into production.  The goal is to track needed changes to the system as first-class items in the development effort.  The non-change operational stuff will almost for sure stay in the ticketing system.  A simple example, but applying the principle will mean that the operating environment of a system evolves right along with its code – not as a retrofit or afterthought when it is time to deploy to a particular environment or there is a problem.

The tool part is conceptually easy – someone manages the changes in the same system that backlog/stories/work items are handled.  However, there  is also the matter of the “someone” mentioned in that sentence.  An emerging pattern I have seen in several shops is to cohabitate an Ops-type with the development team.  Whether these people are the ‘ops representative’ or ‘infrastructure developers’ their role is to focus on evolving the environment along with the code and ensuring that the path to production is in sync with how things are being developed.  This is usually a relatively senior person who can advise on things, know when to say no, and know when to push.  The real shift is that changes to the operating environment of an application become first-class citizens at the table when code or test changes are being discussed and they can now be tracked as part of the work that is required to deliver on an iteration.

These roles have started popping up in various places with interesting frequency.  To me, this is the next logical step in Agile evolution.  Having QA folks in the standups is accepted practice nowadays and organizations are figuring out that the Ops guys should be at the table as well.  This does a great job of pro-actively addressing a lot of the release / promotion headaches that slow things up as things move toward production.  Done right, this takes a lot stress and time out of the overall Agile release cycle.

New Toy!!! IBM Workload Deployer

The company I work for serves many large corporations in our customer base, many of whom are IBM shops with the commensurately large WebSphere installed bases.  So, as you might imagine, it behooves us to keep abreast of the latest stuff IBM delivers.

We are fortunate enough to be pretty good at what we do and are in the premiere tier of IBM’s partner hierarchy and were recently able to get an IBM Workload Deployer (IWD) appliance in as an evaluation unit.  If you are not familiar, the IWD is really the third revision of the appliance formerly known as the IBM WebSphere Cloudburst appliance.  I do not know, but I would presume the rebrand is related to the fact that the IWD is handling more generic workloads than simply those related to WebSphere and therefore deserved a more general name.

You can read the full marketing rundown on the IBM website here:  IBM Workload Deployer

This is a “cloud management in a box” solution that you drop onto your network, point at one one or more of the supported hypervisors, and it handles images, load allocation, provisioning etc.  You can give it simple images to manage, but the thing really lights up when you give it “Patterns” – a term which translates to a full application infrastructure (balancing webservers, middleware, DB, etc.).  If you use this setup, the IWD will let you manage your application as a single entity and maintain the connections for you.

I am not an expert on the thing – at least not yet, but a couple of other points that immediately jump out at me are:

  • The thing also has a pretty rich Python-based command line client that should allow us to do some smart script stuff and maintain those in a proper source repository.
  • The patterns and resources also have intelligence in them where you can’t break dependencies of a published configuration
  • There are a number of pre-cooked template images that don’t seem very locked down that you can use as starter points for customization or you can roll your own.
  • The Rational Automation Framework tool is here, too, so that brings up some migration possibilities for folks looking to bring apps either into a ‘cloud’ or a better managed virtual situation
I do get to be one of the first folks to play with the thing, so I’ll be drilling into as many of these these and other things as time permits.  More on it as it becomes available.

Change Mis-management (part 1)

One of the pillars of DevOps thinking is that the system is a whole.  No part can function without the others, so they all should be treated as equals.  Of course, things rarely work that way.  One of the glaring examples in a lot of shops is the disparity in the way changes are managed / tracked between Dev and Ops.  There are multiple misbehaviors we can examine in just this one area.  Some other day we can discuss how there are different tracking systems for different parts of the system and how many shops have wholly untracked configurations for some components.  Today, instead, we’re going to talk about the different levels of diligence that get applied in Ops versus Dev when dealing with change.

Think about this for a second.  No developer in an enterprise shop would ever think of NOT checking in all of their code changes.  And if they did, they would view it as a pretty serious bypass of good practice.  Code goes into the repository and is pulled from that repository for build, test, and deployment.  It is a backbone constant practice of commercial software development.  Meanwhile, the Ops team has a bunch of scripts that they use to maintain the environment.  How many of those are religiously checked into a version control system (VCS)?  And of those that do end up in a VCS, how many have change tickets attached when they are modified?  And then there are the VM template images, router configs, etc. that may / may not be safely stored someplace.

All too often the change management that happens here is a script update or a command executed someplace on some piece of infrastructure.  The versioning takes the form of a file extension; you know –  “.old”, “.bak”, “.orig” “.[todaysdate]” so that there is some… evidence… that a change was made to the system.  The tracking of the change is often a manually updated trouble ticket or change request.  And let’s not forget that the Ops ticketing system probably does not talk to the change management system the developers use.  Is it any wonder that things get screwed up when something comes down the pipe from Dev?

To really have things working properly, you have to:

  • Unify the change management between Ops and Dev
  • Track scripts the way you would any source code on which your apps function depends.
  • Have a method to automatically capture changes made to the environment and log them.

All three of these things are necessary if you really want to achieve a higher-speed and more predictable release process.