How Fast Should You Change the Tires?

I am an unabashed car nut and like to watch a variety of motor racing series. In particular I tend to stay focused on Formula 1 with a secondary interest in the endurance series (e.g Le Mans). In watching several races recently, I observed that the differences in how each series managed tire changes during pit stops carried some interesting analogies to deploying software quickly.

Each racing series has a different set of rules and limitations with regard to how pit stops may be conducted. These rules are imposed for a combination of safety reasons, competitive factors, and the overall viability of the racing series. There are even rules about changing tires. Some series enable very quick tire changes – others less so. The reasons behind these differences and how they are applied by race teams in tight, time competitive situations can teach us lessons about the haste we should or should NOT have when deploying software.

Why tire changes? The main reason is that, like deploying software, there are multiple potential points of change (4 tires on the car – software, data, systems, network with the software). And, in both situations, it is less important how fast you can change just one of them than how fast you change all of them. There is even the variants where you may not need to change all 4 tires (or system components) every time, but you must be precise in your changes.

Formula 1

Formula 1 is a fantastically expensive racing series and features extreme everything – including the fastest pit stops in the business. Sub 4-second stops are the norm, during which all 4 tires are changed. There are usually around18 people working on the car – 12 of whom are involved in getting the old tires off and clear while putting new tires on (not counting another 2 to work the jacks). That is a large team, with a lot of expensive people on it, who invest a LOT of expensive time practicing to ensure that they can get all 4 tires changed in a ridiculously short period of time. And they have to do it for two cars with potentially different tire use strategies, do it safely, while competing in a sport that measures advantage in thousandths of a second.

But, there is a reason for this extreme focus / investment in tire changes. The tire changes are the most complex piece of work being done on the car during a standard pit stop. Unlike other racing series, there is no refueling in Formula 1 – the cars must have the range to go the full race distance. In fact, the races are distance and time limited, so the components on the cars are simply engineered to go that distance without requiring service, and therefore time, during the race. There are not even windows to wash – it is an open cockpit car. So, the tires are THE critical labor happening during the pit stop and the teams invest accordingly.

Endurance (Le Mans)

In contrast to the hectic pace of a Formula 1 tire change is Endurance racing. These are cars that are built to take the abuse of racing for 24 hours straight. These cars require a lot of service over the course of that sort of race and the tires are therefore only one of several critical points that have to be serviced in the course of a race. Endurance racers have to be fueled, have brake components replaced, and the three drivers have to switch out periodically so they can rest. The rules of this series, in fact limit the number of tire wrenches the team can use in the pits to just one. That is done to discourage teams from cutting corners and also to keep team size (and therefore costs) down.

NASCAR

NASCAR is somewhere between Formula 1 and Endurance racing when it comes to tire changes. This series limits tire wrenches to two and tightly regulates the number of people working on the car during a pit stop. These cars require fuel, clean-up, and tires just like the Endurance cars, but generally do not require any additional maintenance during a race, barring damage. So, while changing tires quickly is important, there are other time eating activities going on as well.

Interestingly, in addition to safety considerations, NASCAR limits personnel to keep costs down to help the teams competing in the series afford the costs of doing so. That keeps the overall series competition healthy by ensuring a good number of participants and the ability of new teams to enter. Which, to contrast, is one of the problems that Formula 1 has had over the years.

In comparing the three approaches to the same activity, you see an emerging pattern where ultimate speed of changing tires gets traded based on cost and contextual criticality. These are the same trade-offs that are made in a business when it looks at how much faster it can perform a regular process such as deploy software. You could decide you want sub-four second tire changes, but that would be dumb if your business needs 10 seconds for refueling or several minutes for driver swaps and brake overhauls. And if they do, your four second tire change would look wasteful at best as your army of tire guys stands around and watches the guy fueling the car or the new driver adjusting his safety harnesses.

The message here is simple – understand what your business needs when it comes to deployment. Take the thrill of speed out of it and make an unemotional decision to optimize; knowing that optimal is contextually fastest without waste. Organizations that literally make their living from speed undestand this. You should consider this the next time you go looking to do something faster.

Advertisements

Rollback Addiction

A lot of teams are fascinated with the notion of ‘rollback’ in their environments.  They seem to be addicted to its seductive conceptual simplicity.  It does, of course, have valid uses, but, like a lot of things, it can become a self-destructive dependency if it is abused.  So, let’s take a look at some of the addictive properties of rollback and how to tell if you might have a problem.

Addictive property #1 – everybody is doing it.  One of the first things we learn in technology (Dev or Ops) is to keep a backup of a file when we change something.  That way, we know what we did in case it causes a problem.  In other words, our first one is free and was given to us by our earliest technology mentors.  As a result, everyone knows what it is.  It is familiar and socially acceptable.  We learn it while we are so professionally young that it becomes a “comfort food”.  The problem is that it is so pervasive that people do it without noticing.  They will revert things without giving it a second thought because it is a “good thing”.  And this behavior is a common reason rollback does not scale.  In a large system, where many people might be making changes, others might make changes based on your changes.  So, undoing yours without understanding others’ dependencies, means that you are breaking other things in an attempt to fix one thing.  If you are in a large environment and changes “backward” are not handled with the exact same rigor as changes forward, you might have a problem.

Addictive property #2 – it makes you feel good and safe.  The idea that “you are only as good as your last backup” is pretty pervasive.  So, the ability to roll something back to a ‘known good’ state gives you that warm, fuzzy feeling that it’s all OK.  Unfortunately, in large scale situations with any significant architectural complexity, it is probably not OK.  Some dependency is almost certainly unknown, overlooked or assumed to be handled manually.  That will lead to all sorts of “not OK” when you try to roll back.  If rollback is the default contingency plan for every change you make and you don’t systematically look at other options to make sure it is the right answer, you might have a problem.

Addictive property #3 – It is easy to sell.  Management does not understand the complexity of what is required to implement a set of changes, but they do understand “Undo”.  As a result, it is trivial to convince them that everything is handled and, if there should be a problem with a change, you can just ‘back it out’.  Being able to simplify the risks to an ‘undo’ type of concept can eliminate an important checkpoint from the process.  Management falls into the all to human behavior of assuming there is an ‘undo’ for everything and stops questioning the risk management plan because they think it is structurally covered.  This leads to all sorts of ugliness should there be a problem and the expectation of an easy back-out is not met.  Does your team deliberately check for oversights its contingency plan every time or does it assume that it will just ‘roll it back’?  If the latter, you might have a problem.

As usual, the fix for a lot of this is self-discipline.  The discipline to do things the hard and thorough way that takes just that little bit longer.  The discipline to institutionalize and reward healthy behaviors in your shop.  And, as usual, that goes against a fair bit of human nature that can be very difficult to overcome, indeed.

How a DOMO Fits

In my previous post, I discussed the notion of a DevOps Management Organization or DOMO.  As I said there, this is and idea that is showing up under different names at shops of varying sizes.  I thought I would share a drawing of one to serve as an example.  The basic structure is, of course, a matrix organization with the ability to have each key role present within the project.  It also provides for shared infrastructure services such as support and data.  You could reasonably easily replace the Business Analyst (BA) role with a Product Owner / Product Manager role and change “Project” to “Product” and have a variant of this structure that I have seen implemented at a couple of SaaS providers around Austin.

This structure does assume a level of maturity in the organization as well as the underlying infrastructure.  It is useful to note that the platform is designated as a “DevOps Platform”.  It would probably be better to phrase that as a cloud-type platform – public, private or hybrid – where the permanence of a particular image is low, but the consistency and automation is high.  To be sure, not all environments have built such an infrastructure, but many, if not most, are building them aggressively.  The best time to look at the organization is while those infrastructures are being built and companies are looking for the best ways to exploit them.

Organization with DOMO

Rise of the DOMO

A lot of the DevOps conversations I have had lately have been around organization issues and what to do about the artificial barriers and silos that exist in most shops.  Interestingly there is a pattern emerging among those discussing or even implementing changes to deal with this discussion.  The pattern involves matrix organizations and the rise of what I call a “DevOps Management Organization” (DOMO).  The actual names vary, but the role of the organization is consistent.

Most software delivery organizations end up with some kind of matrix where product managers, project managers, engineers, architects, QA are all tied to the success of a given project / product while maintaining a discipline-specific tie.   In the case of an ISV, you can add some other disciplines around fulfillment, support, etc.  The variable is whether the project/ product is their direct reporting organization or they report to a discipline.  And the answer can depend on the role.  For example, a Project Manager might be part of the engineering organisation and be a participant in the company’s PMO.  Or they might be part of the PMO and simply assigned to (and funded by) the project / product they are working on.

When you add Agile to the mix, the project dimension tends to take on a much higher level of primacy over the disciplines since the focus becomes much tighter on getting working software produced on a much tighter iteration timeline.  This behavior leads to the DevOps discussion as the project/product team discovers that there is no direct alignment of Operations to the project efforts.  Additionally, Operations is a varied / multi-disciplinary space in and of itself.  Thus it becomes extremely difficult for the project/product team to drive focused activity through Operations to deliver a particular iteration/release.  The classic DevOps problem.

The recent solution trend to this problem has been the creation of what I call “Ops Sherpa” roles in the project/product teams.  This role is a cross-disciplinary Ops generalist who is charged with understanding the state of the organization’s operations environment and making sure that the development effort is aligned with operational realities.  That includes full lifecycle responsibility – from ensuring that Dev and QA environments are relevant equivalent configurations to production in order make sure deliverables are properly qualified to making sure that various Operations disciplines are aware of (and understand) any changes that will be required to support a particular release deliverable.  In more mature shops, this may grow out of an existing Release Management role or, if a particularly large

Critically, though, this role gets matrixed back into the Operations organization at a high enough level to sponsor cross-operational silo action.  This point is the head of what I call the Head of the DOMO and provides the point of leverage to deal with tactical problem as well as the strategic guidance role to drive cross-project continuous improvement into the operations platform space in support of faster execution (aka DevOps speed releases).

Whatever the name,  the fact that large scale companies are recognizing the value of deliberately investing in this space is a validation that being good at release execution is strategic to cost-effectively shortening release cycles.

Leaders Should Make Their Teams Teach

Teaching something will make you a better practitioner of that thing.  It is an adjunct to the old adage that true mastery of a subject is the ability to teach it to someone else.  The act of educating someone on something forces you to organize your thoughts on that subject.  This, in turn, gives you new insights on the subject and/or makes you more efficient at processing the subject.  Therefore, there is a lot of value to the teacher in the act of educating.

There is a business value as well.   But a lot of managers, however are not comfortable pushing their team members to do so – particularly if they have team members who resist doing it.  I have spoken to a number of managers who really have no idea how to break down that resistance.  There is obviously the ‘stick’ side of it where it should be a part of a senior person’s job description and if  they don’t do it, their review will not be as good as it could be.  But the stick side is a pretty weak instrument and it just breeds problems over the long haul.  The more important aspect is the carrot side of the discussion.

Look at it this way:   no matter how good a coder someone is, their value is intrinsically limited by the amount of code they can physically produce in a certain amount of time.  That means that once they reach their personal peak productivity, they are basically plateaued in terms of their career opportunity due to laws of physics and biology (i.e. only so many hours in a day and humans need to sleep).  Contrast that with someone who has reached their productivity peak and uses their skills to help make others better.  That person is now leveraging their skills through a larger group and enhancing the produtivity of that overall group by passing on lessons and learnings.  They are multiplying themselves through that group.  That person’s value has not plateaued.  This does not mean they are on a ‘management’ track, either – though it can lead there.  I have found that engineering organizations that have been successful over time all have a non-management ‘technical leader’ career path for folks who are both strong technically and effective leader/teacher/mentors.  Does your organization think like that?  It should.

Even if it doesn’t, it is irresponsible of the manager to not at least have a frank conversation about this to their team members.  That a team lead does a massive disservice to a team member’s career if they just encourage them to be ‘super techie’.  It will fundamentally limit their team members’ value and the value of the team in general.

Start Collaboration with Teaching

Every technology organization should force everyone in the group to regularly educate the group on what they are doing.  This should be a cross-discipline activity – not a departmental activity.  There are three reasons to do this.  The first is obvious – there is an intrinsic value in sharing the knowledge.  The second is that the teachers themselves get better at what they are teaching about for the reasons described above.  The third is that it serves to create relationships among the groups that will open channels of collaboration as the organization grows.

This will create more opportunities for someone to have a critical insight on a situation and invent something valuable as a result.  It may be as basic as the fact that the team is faster at solving problems because they know who to call and have a relationship with that person.  It also means that you have a better chance of keeping your ‘bus number at healthier levels thereby making your organization more resilient overall.  Of course, it will also make your overall organization more cohesive meaning people will be somewhat more likely to stay and ensuring that you have fewer ‘bus number’ situations in the first place – or at least fewer that were not caused by a bus

Classic Metrics for How Good You Are

One of the ‘best metrics ever’ is the classic “bus number”.  This measures how many people in an organization can be hit by a bus before that organization’s operations or progress is severely hindered due to that person’s absence.  This was a slightly funny way of measuring resilience of an organization versus the anti-pattern of knowledge hoarding in an individual’s brain.  The idea is that a resilient organization should have a very high bus number and not be vulnerable to ‘critical staff’.

Think about it next time you are looking at any part of your system.  Ask yourself who you would ask about that particular module, image, or whatever.  Then ask yourself who you would go to if the first person was unavailable.  How confident are you that you would quickly / expediently get your answer?  How confident are you that you could just look the information up in a Wiki or other documentation?

If you look at the questions above and start thinking that ‘we would figure it out after a while’ or making other excuses, you minimally have a problem with communications and collaboration.  You almost certainly have a process problem.  And you may well have a cultural problem.  Make no mistake, what it means is that your team/organization/project are playing in traffic and simply waiting for the inevitable to happen.

And when something does happen, say, for example, one of the project’s “hero coders” takes a new job, it will be miserable for all who remain as they try to figure out what the hero was doing.  Meanwhile the project’s progress languishes and the deadline becomes unachievable.  Morale goes down as frustration goes up.  Maybe someone else decides to leave out of a sense of futility; making the problem worse.  And it will have been completely avoidable.  It will be completely the fault of the leadership that was either not assertive enough to make the hero share their knowledge or undisciplined enough to not include sustainability in their coaching, plans, and day-to-day execution priorities.

This is serious stuff and is worth the investment of time to solve.  The habit of focusing on the overall sustainability of the organization is well documented as something that successful, resilient, and sustainable organizations emphasize.  This is well documented in the classic book “Good to Great” by Jim Colins, where the book describes the organizations being built as sophisticated machines using the analogy of clock building.  The notion is simple, really.  That the goal is to build a lasting thing that continues on as people come and go.  The project / organization must be bigger than any individual, the individuals involved must understand that, and management must encourage or enforce that mindset.  In the book, the organizations that did this radically outperformed their peers in the same markets in the same timeframe.

The reality is that you will probably always have some stuff (ideally only non-critical or very new stuff) that is not well disseminated, but take those in the lens of what they are and triage / prioritize them so that you do not accumulate the knowledge gap as technical debt.  Or, if you do, you should do so consciously, visibly, and at a level at which you know you can tolerate the risk.