Lean Software Development and Automation

Over the past decade or more, the interest in Agile and Lean topics has grown substantially as businesses have come to see software as a key part of their brand, their customer experience, and their competitive advantage. Understanding the principles at a basic level is relatively straightforward – a testament to how well they are articulated and thought out – however, executing on them can be difficult. One of the key tools in successfully executing around the vision of Lean is exploiting the power of automation. A frequent source of confusion is that automation itself is not a goal – rather, it is a very powerful means to achieving the goal. To clarify the point of automation helper rather than end, this post will look at the Principles of Lean Software Development as defined by Tom and Mary Poppendieck in their seminal work “Lean Software Development: An Agile Toolkit” (2003) and some ways automation enables them.

Principles of Lean Software Development

1- Principle: Eliminate waste

Tracing its way all the way back to the core of Lean manufacturing, this principle is about eliminating unnecessary work, such as fixes, infrequently used features, low-priority requirements, etc. that does not translate to actual customer value. In many ways, this principle underpins all of the others as guiding context for why they are valuable and important in their own right.

Automation carries both direct and indirect value for eliminating waste. In direct terms, it simply cuts cycle times by doing things faster relative to doing them manually. The indirect value is that speed enables the shorter feedback loops and the amplified learning that allow the team to make better decisions faster. Between the direct and indirect value, it is easy to see why there is so much focus on automation among the Lean, Agile and DevOps movements – it is at the core of waste elimination.

2 – Principle: Build Quality In

The notion of ‘building quality in’ deals with the point that it is fundamentally more efficient and cost-effective to build good code from the beginning than to try to ‘test quality in’ later. Testing late in the cycle, even though seen as a norm for years, is actually devastating to software projects. For starters, the churn of constantly fixing things is waste. Further, the chances of introducing all new problems (and thus extending expensive test cycles) increases with each cycle. Finally, the impact to schedules and feature work can be very damaging to customer value. There are numerous other problems as well.

The developers building the system need to have the proper facilities for ensuring that the code they have written actually meets the standard. Otherwise, they are relying on downstream tests to put the quality in and have thus violated this principle. Since the developers are human and their manual work, including any manual testing they might do, is therefore relatively slow error prone, automation is the only practical answer for ensuring they can validate their work while they are still working on it. This takes the form of techniques like Continuous Integration, automated tests during build cycles, and automated test suites for the integrated system once the changes have passed their unit tests. Automation provides the speed and consistency required to operate with such a high level of discipline and quality of work.

3 – Principle: Create Knowledge

This principle, sometimes written as “Amplify Learning”, addresses the point that the act of building something teaches everyone involved new ways of looking at both the original problem as well as what the best solution would be. Classic ‘omniscient specification’ at the beginning of a project carries the bizarre assumption that the specifier knows and understands all aspects of both problem and solution before writing the first word of the specification. This is obviously very unlikely at best. Lean and Agile address how the team quickly and continuously seeks out this learning, distributes the knowledge to all stakeholders, and takes action to adjust activities based on the new understanding. This behavior is one of the core maxims that really delivers the ‘agility’ in Agile.

Automation, as we have seen, provides speed and consistency that is not otherwise available. These capabilities serve to create knowledge by enabling the faster, easier collection of data. The data might be technical, in the form of the test results mentioned above as part of “Build Quality In”. A more advanced scenario might be more frequent value assessment – achieved by giving the business owners an easy facility for seeing a completed, or nearly completed, feature sooner – in order to validate the implementation before it is final. Even more advanced variants involve techniques such as “Canary deployments” or A/B testing – in which a limited audience of live customers receives early versions of features in order to analyze their response.

4 – Principle: Defer Commitment

The Defer Commitment principle addresses the point that teams would not want to take a design direction that they later learn was a fundamental ‘dead-end’. This principle is a response to the impact of knowledge creation (Principle #3 above). By delaying decisions that are hard to reverse, the team can dramatically reduce the risk of hitting a ‘dead end’ that might cause expensive rework or even kill the project.

Automation as applied to this principle also reflects the tight relationship to “Create Knowledge“. By exploiting the ability to collect more knowledge faster, and with a more complete context, teams can ensure they have the most thorough set of information possible before making a hard-to-reverse decision. Fast cycle time can also enable experimental scenarios that would not otherwise be possible. Promising architectural ideas can be prototyped in a more realistic running state because it is not too hard or time consuming to do so. This opens the team up to new and potentially better solutions, which would otherwise might have been too risky. Whether about a particular feature, architectural point, or design element, automation enables the team to ensure that it has real data from many deployment and test cycles before committing.

5 – Principle: Deliver Fast

The principle of delivering fast uses the fact that short cycle times mean less waste, more customer feedback, and more learning opportunities. A fast cycle will generally have less waste because there will be less wait time and less unfinished work at any given point in time. The ability to deliver quickly also means that more frequent customer feedback, which, as we have discussed, reduces waste and risk while increasing knowledge. Finally, delivering quickly will cause the team to focus on finishing a smaller number of features for each delivery rather than leaving many undone for a long period.

As has been described, speed is a common byproduct of automation. Getting working code into the hands of stakeholders is a key part of every approach to enabling business agility. In the case of applying automation’s speed to this principle, however, it is better to think in terms of frequency. Speed and frequency are closely related factors, of course – a long cycle time implies less frequent delivery and vice-versa. The point is simply that without automation, frequency will always be much lower. That means less feedback, less learning, and less knowledge for the team to use.

6 – Principle: Respect People

Beyond the obvious human aspects of this point, this principle is actually very pragmatic. The people closest to the work are the ones who know it best. They are best equipped to identify and solve challenges as they come up. Squashing their initiative to do so will diminish the team’s effectiveness and, through missed opportunities over time, the cost-effectiveness and value of the software itself.

In the previous principles, there are numerous statements about giving new capabilities to the team. This principle deals with how automation empowers the members of the team. Indeed, the alternate expression of this principle is “Empower the Team”. That phrasing gets to the crux of how automation does or does not respect the people. Automation itself cannot show respect, but how it is deployed most certainly can. For example, the contrast between a self-service facility that anyone on the team can use at any time and a similar facility for which individuals must ask permission each time they use it speaks volumes about the respect the organization has for the team’s professionalism. It will also drive behavior and self-discipline as the team matures. Consider how the practice in high-maturity Continuous Delivery scenarios has a direct, automatic path from check-in to production while so many shops still require multiple sign-offs. Which team is more likely to be effective, innovative, and efficient?

7 – Principle: Optimize the whole

This principle really focuses on how all of the principles are interrelated across the whole lifecycle. Other management theories, such as the “Theory of Constraints” address this with statements such as ‘you can never be faster than the slowest step’. This principle continues the theme of continuous learning and adjustment that pervades Lean thinking. It deals with the fact that in order to take time and waste out of a system, you need to understand its goal and then continuously and deliberately eliminate the root cause of the largest bottlenecks that prevent the most efficient realization of that goal.

The core of this principle is to optimize delivery of value to the customer – effectively starting with the value and working backward to the start of the process. Automation as a tool in that effort makes optimization substantially easier. When starting with a process that has manual steps, the very act of automating a process is an optimization by itself. Until the entire process flows from end to end with automation, the manual phases will be the more obvious bottlenecks. Then, once the automation spans the whole flow, the automation itself generates metrics for further improvement in cycle time and efficiency in pursuit of delivering value to the customer.

That optimization effort of the last principle takes the discussion somewhat circularly back to the first principle, which is to eliminate waste. That is quite appropriate. Given how interrelated all of these principles are, the discussion of the contributions of automation to them should be similarly interrelated.

This post is also on LinkedIn here: https://www.linkedin.com/pulse/lean-software-development-automation-dan-zentgraf

DevOps is about Building Fords, not Ferraris

There is an interesting obsession with having the ‘ultimate’ of whatever you’re talking about. This applies to most things in our society: jobs, houses, televisions, cars. You name it, there is an ‘ultimate’ version that everyone aspires to have. There is a lot of good to this behavior, to be sure. I believe strongly that everyone should be trying to get better all the time. Though I would point out that it is healthier to regard the ultimate [whatever] as a consequence or benefit of getting better rather than an end unto itself.

But it’s usually bad to want the ‘ultimate’ in your software delivery process. Goldplating has always been an enemy in software projects and there is evidence of it in how a lot of organizations have traditionally delivered software. It usually shows up in the culture, where high-intervention processes lead to hero cults and aspirations to be the ultimate ‘hero’ who gets releases out the door. Old-school, old-world hand craftsmanship is the order of the day. DevOps is the exact opposite of this approach. It focuses on a highly repeatable, scalable, and mass-produced approach to releasing software. And frequently.

Which brings me back to the contrast between a Ferrari and a Ford. A Ferrari is pretty much the ultimate sports car and ultimate sports car brand. There really is very little not to like. But the cars are exotics still built with expensive materials using manual, old world techniques. To be fair, Ferrari has a super-modern robotic process for a lot of their precision work, but they add a lot of customization and hand-finishing. And they ship a very few thousand releases (cars) each year. Sustaining such a car in the real world involves specially trained mechanics named Giuseppe, long waits for parts from Italy, and even shipping the car across the state if you don’t live close to a qualified shop. No biggie – if you can afford the car, you can afford the maintenance. But, let’s face it, they are a ‘money is no object’ accessory.

Ford has shipped a variety of performance models over the years based on the Mustang platform. In fact, there have been years where Ford has shipped more performance Mustangs in a week than Ferrari would ship cars in that YEAR. And there is a magic there for a DevOps geek. Plain ol’ Ford Motor Company has started selling a 200mph Mustang this year for about $60K. There’s nothing too exotic about it. You can go to your local Ford dealer and buy it. It can be purchased at one dealer and serviced at any other dealer anywhere in the country. Parts? No problem – most of them are in local warehouses stationed strategically so that no dealer would have to keep a customer waiting too long for common items. A lot of stuff can be had from your local AutoZone because, well, it’s “just” a Mustang.

The lesson, though, is that Ford has an economy of scale by virtue of the volume of Mustangs it produces. No, a Mustang is not as nice or as custom as a Ferrari. It is as common and mass-produced as anything. But a 200mph car that anyone can buy for noticeably less than a house, get parts easily, and have serviced at thousands of locations is an amazing and magical thing. It teaches a solid lesson about scalability and sustainability that should be inspirational for DevOps teams.

And maybe, just maybe, if your company does a good enough job at sustainably delivering your software, you might be able to afford that Ferrari someday…

PS – for Chevy zealots. I realize the Corvette cleared 200 on a “volume” platform first. But the 200mph Plastic Fantastic looks more exotic relative to the Mustang – which has a plain “sporty commuter” or even rental fleet version with a V6. And the common example of the economies of scale mean that the 200mph Shelby Mustang is still a bargain relative to the 200mph capable ‘vette, which is the point of this post.

A System for Changing Systems – Part 9

The last capability area in the framework is that of Monitoring. I saved this for last because it is the one that tends to be the most difficult to get right. Of course, commensurate with the difficulty is the benefit gained when it is working properly. A lot of the difficulty and benefit with Monitoring comes from the fact that knowing what to look at, when to look at it and what NOT to look at are only the first steps. It also becomes important to know what distributed tidbits of information to bring together if you actually want a complete picture of your application environment.

Monitoring

Monitoring Capability Area

This post could go for pages – and Monitoring is likely going to be a consuming topic as this series progresses, but for the sake of introduction, lets look at the Monitoring capability area. The sub-capabilities for this area encompass the traditional basics of monitoring Events and Trends among them. The challenge for these two is in figuring out which Events to monitor and sometimes how to get the Event data in the first place. The Trends must then be put into a Report format that resonates with management. It is important to invest in this area in order to build trust with management that the team has control as it tries to increase the frequency of changes – without management’s buy-in, they won’t fund the effort. Finally, the Correlation sub-capability area is related to learning about the application system’s behavior and how changes to some part of the system impacts the other parts. This is an observational knowledge base that must be deliberately built by the team over time so that they can put the Events, Trends, and Reports into the most useful contexts and use the information to better understand risks and priorities when making changes to the system.

A System for Changing Systems – Part 8

The fourth capability area is that of Provisioning. It covers the group of activities for creating all or part of an environment in which an application system can run. This is a key capability for ensuring that application systems have the capacity they need to maintain performance and availability. It is also crucial for ensuring that development and test activities have the capacity they need to maintain THEIR performance. The variance with test teams is that a strong Provisioning capability also ensures that development and test teams can have clean dev/test environments that are very representative of prorduction environments and can very quickly refresh those dev/test environments as needed. The sub-capabilities here deal with managing the consistency of envionment configurations, and then quickly building environments to a known state.

Provisioning Capability Area

Provisioning Capability Area

The fifth capability area is closely related to Provisioning. It is the notion of a System Registry capability. This set of capabilities deals with delivering the assumed infrastructure functions (e.g. DNS, e-mail relays, IP ranges, LDAP, etc.) that surround the environments. These capabilities must be managed in such a way that one or more changes to an application system can be added to a new or existing environment with out significant effort or disruption. In many ways this capability area is the fabric in which the others operate. It can also be tricky to get right because this capability area often spans multiple application systems.

System Registry Capability Area

System Registry Capability Area

A System for Changing Systems – Part 7 – Deployment Capabilities

The third capability area is that of Deployment. Deployment deals with the act of actually putting the changes into a given target environment. It is not prescritive of how this happens. Many shops mechanically deal with deployment via their provisioning system. That is obviously a good thing and an efficiency gain by removing a discrete system for performing deployment activities. It is really a best practice of the most mature organizations. However, this taxonomy model is about identifying the capabilities needed to consistently apply changes to a whole application system. And, lets face it, best practices tend to be transient; as new, even better, best practices emerge.

Deployment Capability Area

Deployment Capability Area

Additionally, there are a number of reasons the capability is included in this taxonomy. First of all, the framework is about capabilities rather than technologies or implementations. It is important to be deliberate about how changes are deployed to all environments and simply because some group of those changes are handled by a provisioning tool does not remove the fact that not all are covered nor does it remove the fact that some deliberate work is expended in fitting the changes into the provisioning tool’s structure. Most provisioning tools, for example are set up to handle standard package mechanisms such as RPM. The deployment activity in that scenario is more one of packaging the custom changes. But the provisioning answer is not necessarily a solution for all four core areas of an applpication system, so there needs to be a capability that deals generically with all of them. Finally, many, if not most, shops have some number of systems where there are legacy technical requirements that require deployment to happen separately.

All of that being true, the term “Deployment” is probably confusing given its history and popular use. It will likely be replaced in the third revision of this taxonomy with something more generic, such as “Change Delivery”.

The sub-category of Asset Repository refers to the fact that there needs to be an ability to maintain a collection of changes that can be applied singly or in bulk to a given application system. In the third revision of the taxonomy, it is likely to be joined by a Packaging sub-capability.  Comments and thoughts are welcome as this taxonomy is evolving and maturing along with the DevOps movement.

A System for Changing Systems – Part 6 – Change Management and Orchestration Capabilities

This post covers the first two capability areas in the system taxonomy. This discussion will begin with where the changes come into the “system for changing systems”, Change Management, and proceed around the picture of top-level capability areas.

The first capability area to look at is Change Management. Change is the fundamental reason for this discussion and, in many ways, the discussion is pointless unless this capability is well understood. Put more simply, you can not apply changes if you do not know what the changes are. As a result, this capability area is the change injector for the system. It is where changes to the four components of the application system are identified, labeled and tracked as they are put into place in each environment. For convenience and in recognition of the fact that changes are injected from both the “new feature” angle as well as from the “maintenance item” angle, the two sources of change are each given their own capability sub-area.

Change Management Capability Area

Change Management Capability Area

The second capability area is that of Orchestration. In a complex system that is maintained by a combination of human and machine-automated prcoesses, understanding what is done, by whom, and in what order is important. This capability area has two sub-areas – one for the technical side and one for the people. This reflects the need to keep the technical dependencies properly managed and also to keep everyone on the same page. Orchestration is a logical extension of the changes themselves. Once you know what the changes are, everyone and everything must stay synchronized on when and where those changes are applied to the application system.

Orchestration Capability Area

Orchestration Capability Area

A System for Changing Systems – Part 5 – Top-level Categories

The first step to understanding the framework is to define the broad, top level capability areas. A very common problem in technology is the frequent over-use of terms that can have radically different meanings depending on the context of a conversation. So, as with any effort to clarify the discussion of a topic, it is very critical to define terms and hold to those definitions during the course of the discussion.

Top level categories of capabilities around various environments in which applications typically must run.

Top level capability areas for sustaining application systems across environments.

At the top level of this framework are six capability groupings

  • Change Management – This category is for capabilities that ensure that changes to the system are properly understood and tracked as they happen. This is a massively overused term, but the main idea for this framework is that managing changes is not the same thing as applying them. Other capabilities deal with that. This capability category is all about oversight.
  • Orchestration – This category deals with the ability to coordinate activity across different components, areas, and technologies in a complex distributed application system in a synchronized manner
  • Deployment – This category covers the activities related to managing the lifecycles of an application systems’ artifacts through the various environments. Put more simply this area deals with the mechanics of actually changing out pieces of an application system.
  • Monitoring – The monitoring category deals with instrumenting the environment for various purposes. This instrumentation concept covers all pieces of the application system and provides feedback in the appropriate manner for interested stakeholders. For example, capacity usage for operations and feature usage for development.
  • System Registry – This refers to the need for a flexible and well-understood repository of shared information about the infrastructure in which the application system runs. This deals with the services on which the application system depends and which may need to be updated before a new instance of the application system can operate correctly.
  • Provisioning – This capability is about creating and allocating the appropriate infrastructure resources for an instance of the application system to run properly. This deals with the number and configuration of those resources. While this area is related to deployment, it is separate because in many infrastructures it may not be desireable or even technically possible to provision fresh resources with each deployment and linking the two would blunt the relevancy of the framework.

The next few posts will dig into the sub-categories underneath each of these top-level items.

A System for Changing Systems – Part 3 – How Many “Chang-ee”s

As mentioned in the last post, once there is a “whole system” understanding of an application system, the next problem is that there are really multiple variants of that system running within the organization at any given time. There are notionally at least three: Development, Test, and Production. In reality, however, most shops frequently have multiple levels of test and potentially more than one Development variant. Some even have Staging or “Pre-production” areas very late in test where the modified system must run for some period before finally replacing the production environment. A lot of this environment proliferation is based on historic processes that are themselves a product of the available tooling and lessons organizations have learned over years of delivering software.

Example Environment Flow

This is a simplified, real-world example flow through some typical environments. Note the potential variable paths – another reason to know what configuration is being tested.

Tooling and processes are constantly evolving. The DevOps movement is really a reflection of the mainstreaming of Agile approaches and cloud-related technologies and is ultimately a discussion of how to best exploit it. That discussion, as it applies to environment proliferation, means we need to get to an understanding of the core problems we are trying to solve. The two main problem areas are maintaining the validity of the sub-production environments as representative of production and tracking the groupings of changes to the system in each of the environments.

The first problem area, that of maintaining the validity of sub-production envrionments, is a more complex problem than it would seem. There are organizational silo problems where multiple different groups own the different environments. For example, a QA group may own the lab configuraitons and therefore have a disconnect relative to the production team. There are also multipliers associated with technical specialities, such as DBAs or Network Administration, which may be shared across some levels of environment. And if the complexity of the organization was not enough, there are other issues associated with teams that do not get along well, the business’ perception that test environments are less critical than production, and other organizational dynamics that make it that much more difficult to ensure good testing regimes are part of the process.

The second key problem area that must be addresssed is tracking the groups of changes to the application system that are being evaluated in a particular sub-production environment. This means having a unique identifier for the combination of application code, the database schema and dataset, system configuration, and network configuration. That translates to five version markers – one for each of the main areas of the application system plus one for the particular combination of all four. On the surface, this is straightforward, but in most shops, there are few facilities for tracking versions of configurations outside of software code. Even when they are, they are too often not connected to one another for tracking groupings of configurations.

They typical pattern for solving these two problems actually begins with the second problem first. It is difficult to ensure the validity of a test environment if there is no easy way to identify and understand the configuration of the components involved. This is why many DevOps initiatives start with configuration management tools such as Puppet, Chef, or VMWare VCenter. It is also why “all-in-one” solutions such as IBM’s Pure family are starting to enter the market. Once an organization can get a handle on their configurations, then it is substantially easier to have fact-based engineering conversations about valid test configurations and environments because everyone involved has a clear reference for understanding exactly what is being discussed.

This problem discussion glosses over the important aspect of being able to maintain these tools and environments over time. Consistently applying the groups of changes to the various environments requires a complex system by itself. The term system is most appropirate because the needed capabilities go well beyond the scope of a single tool and then those capabilities need to be available for each of the system components. Any discussion of such broad capabilities is well beyond the scope of a single blog post, so the next several posts in this series will look at framework for understanding the capabilities needed for such a system.

A System for Changing Systems – Part 2 – The “Chang-ee”

As discussed last time, having a clear understanding of the thing being changed is key to understanding how to change it. Given that, this post will focus on creating a common framework for understanding the “Change-ee” systems. To be clear, the primary subject of this discussion are software application systems. That should be obvious from the DevOps discussion, but I prefer not to assume things.

Application systems generally have four main types of components. First, and most obviously, is the software code. That is often referred to as the “application”. However, as the DevOps movement has long held, that is a rather narrow definition of things. The software code can not run by itself in a standalone vacuum. That is why these posts refer to an application *system* rather than just an application. The other three parts of the equation are the database, the server infrastructure and the network insfrastructure. It takes all four of these areas working together for an application system to function.

Since these four areas will frame the discussion going forward, we need to have a common understanding about what is in each. It is important to understand that there are variants of each of these components as changes are applied and qualified for use in the production environment. In other words, there will be sub-production environments that have to have representative configurations. And those have to be considered when deciding how to apply changes through the environment.

  • Application Code – This is the set of functionality defined by the business case that justifies the existance of the application system in the first place and consists of the artifacts created by the development team for the solution including things such as server code, user interface artifacts, business rules, etc.
  • Database & Data – This is the data structure required for the application to run. This area includes all data-related artifacts, whether they are associated with a traditional RDBMS, “no sql” system, or just flat files. This includes data, data definition structures (eg schema), test datasets, and so forth.
  • Server Infrastructure (OS, VM, Middleware, Storage) – This represents the services and libraries required for the application to run. A broad category ranging from the VM/OS layer all the way through the various middleware layers and libraries on which the application depends. This area also includes storage for the database area.
  • Network Infrastructure – This category is for all of the inter-system communications components and links required for users to derive value from the application system. This includes the connectivity to the users, connectivity among servers, connectivity to resources (e.g. storage), and the devices (e.g. load balancers, routers, etc.) that enable the application system to meet its functional, performance, and availability requirements
Application System Components

Conceptual image of the main system component areas that need to be in sync in order for a system to operate correctly

The complicating factor for these four areas is that there are multiple instances of each of them that exist in an organization at any given time. And those multiple instances may be at different revision levels. Dealing with that is a discussion unto itself, but is no less critical to understanding the requirements for a system to manage your application system. The next post will examine this aspect of things and the challenges associated with it.

A System for Changing Systems – Part 1 – Approach

This is the first post in a series which will look at common patterns among DevOps environments.  Based on these patterns, they will attempt to put a reasonable structure together that will help organizations focus DevOps discussions, prioritize choices, and generally improve how they operate.

In the last post, I discussed how many shops take the perspective of developing a system for DevOps within their environments.  This notion of a “system for changing systems” as a practical way of approaching DevOps requires two pieces.  The first is the system being changed – the “change-ee” system.  The second is the system doing the changing – the “DevOps”, or “change-er” system.  Before talking about automatically changing something, it is necessary to have a consistent understanding of the thing being changed.  Put another way, no automation can operate well without a deep understanding of the thing being automated.  So this first post is about establishing a common language for generically understanding the application systems; the “change-ee” systems in the discussion.

A note on products, technologies and tools…  Given the variances in architectures for application (“change-ee”) systems, and therefore the implied variances on the systems that apply changes to them, it is not useful to get product prescriptive for either.  In fact, a key goal with this framework is to ensure that it is as broadly applicable and useful as possible when solving DevOps-related problems in any environment.  That would be very difficult if it overly focused on any one technology stack.  So, these posts will not necessarily name names other than to use them as examples of categories of tools and technologies.

With these things in mind, these posts will progress from the inside-out.  The next post will begin the process with a look at the typical components in an application system (“change-ee”).  From there, the next set of posts will discuss the capabilities needed to systematically apply changes to these systems.  Finally, after the structure is completed, the last set of posts will look at the typical progression of how organizations build these capabilities.

The next post will dive in and start looking at the structure of the “change-ee” environment.