When a Process Can’t Fail, Be Brilliant About the Basics

From time to time I’ve consulted on or been responsible for managing processes that couldn’t ever go wrong. A process failure could have had catastrophic safely or economic consequences. A lot of times, the critical issues were to do with technology, but plenty were also dependent on human beings operating near to perfection — and that’s really tough to achieve.

If you look at the history of failures in any process or organization, you tend to see a number of things happening over and over again:

There is a single point of failure in the process design, often some aspect of technology or a single critical decision-maker needed to handle exceptions.
There are operational pressures that cause even skilled and well-motivated employees or contractors to make errors, usually because they don’t follow the defined process closely enough.
There are people in roles for which they are not suited, who cannot meet performance standards for the work assigned to them.

In my by-now-extensive experience, these three root causes together account for well over 90 percent of all failures and service interruptions. If you can’t afford any failures at all, you have to address each of them — and do so within reasonable financial and behavioral constraints. Here’s my approach.

John Parkinson 2017 Tech

The technology side is relatively easy, although not cheap. High-availability (HA) designs that combine some level of segmentation (like 10 units each with 10 percent of total required capacity) with some level of redundancy (two additional units, one able to take over if a unit fails for any reason and a further spare in case a second failure occurs before the first can be remediated) will generally give you around 99.99 percent availability.

You can still be hurt by “cascading failures,” where a failed unit triggers overloads and consequent failures in adjacent units before a spare can come into the capacity pool. That means you have to design in “graceful degradation” modes that accept short-term reduction in performance or quality of service rather than a complete service interruption — think of this as the equivalent of “brownouts” vs. blackouts in the power grid.

A good HA design has no single points of failure, adequate (but not excessive) spare capacity, and automatic switchover to a ready spare when a failure occurs. These additional design elements get you to 99.999 percent (the fabled “five nines”). That’s still equivalent to about five minutes a year of “incident time,” which is why the graceful degradation design factor comes into play. Add that in to get at least another “nine” — no more than 30 seconds a year of potential trouble. For most business processes, assuming you can recover quickly, that’s good enough.

Although I’ve described this in terms that tend to imply technology, the design principles apply equally to people-oriented tasks. Supervisors can step in to relieve workers; lower-priority work can be stopped to focus on reducing customer queues; managers can step in for supervisors; and so on. We still need to design the work so that this happens seamlessly, with all that implies for training, opportunities to practice, and attitude among the workforce.

Operational pressures need a different strategy. What should be one of the most dangerous workplaces on Earth, but isn’t, is the deck of an aircraft carrier during flight operations. There actually are few accidents and injuries, because process, role definition, training and — critically — well-designed checklists drive everything that anyone does.

Checklists show up in many places where there are critical risks associated with a mistake or misstep, but they can be used in far more routine situations too. They have the advantage that (almost) anyone can perform the work if they have the requisite basic skills. The expert doesn’t always have to be there. (An excellent read on this is Atul Gawande’s “The Checklist Manifesto.”) Really good checklists include what to do when the process doesn’t work as it should — in process-design lingo, that’s called error and exception handling,.

In essence, a checklist is just a form of process-design documentation, aimed at the people who will do the work. A good friend of mine who was for years a check captain for a major airline told me he could preflight an airplane in his sleep, as he had tens of thousands of flying hours and thousands of takeoffs, but he’d never try it without a checklist. He’d probably never make a mistake, but the consequences if he did weren’t worth it. He never felt that using the checklist demeaned his skills as a pilot; in fact it made him better at his job (which entailed much more than just flying the airplane).

Our final challenge focuses on the people who are in the wrong role for their skills, attitudes and motivations. Here I want to share an idea I have implemented on several occasions where the root-cause analysis indicated that there was a persistent problem with a small proportion of employees causing most of the incidents through carelessness or lack of ability.

In each case we implemented a program of team-level performance bonuses, earned for periods without any customer-impacting service interruptions. The employee teams were given three months to fix anything they felt was an impediment to perfect service delivery (old equipment, lack of documentation, inadequate training and so on), before the clock started. Then, for every month of interruption-free service delivery the team accrued a (modest) bonus. After three continuous months, the accrued bonus was paid out and the process restarted. However, if there was an interruption, all accrued bonuses were forfeit and the clock reset. There were additional bonuses for 6 and 12 months of continuous interruption-free performance.

The teams quickly became self-policing. They all knew who the poor performers were, and over a remarkably short period they either coached them to do better or eased them out of the organization. In every case, after 6 to 12 months, almost every employee-caused service interruption had been eliminated.

The bonus added about 15 percent to operational costs, but service interruptions had been much more expensive than that, so overall quality was way up and costs were down. You tend to get what you measure, so by measuring (and rewarding) what we wanted we were able to virtually eliminate service interruptions. The strategy of self-regulation became self-sustaining. New team members were carefully supervised until they were fully integrated into the process. Improvement suggestions and risk-mitigation ideas appeared regularly. The teams really wanted to do a perfect job and became intolerant of anything that prevented them from doing so.

Taken together, these three strategies — HA design, checklists, and team performance bonuses supplemented by peer pressure — combine to deliver an operational approach to routine excellence that I call “Brilliant at the Basics” (I “borrowed” that from Roy Dunbar, the former president of MasterCard’s’ technology and operations Group; thanks Roy). It doesn’t work for everything, but when you manage a process that can’t ever fail, it’s a great place to start.

John Parkinson is an affiliate partner at Waterstone Management Group in Chicago. He has been a global business and technology executive and a strategist for more than 35 years.