The image shows two chess pieces on a light, almost white background. The central piece, a black king, stands upright. Next to it lies another king – presumably white – on its side, apparently defeated. The scene appears minimalistic and focuses exclusively on this symbolic moment, which is often interpreted as representing victory and defeat.

According to a 2017 report by the Project Management Institute (PMI), 14 percent of all IT projects fail. And they fail completely. Of the remaining projects that do not fail completely, 31 percent do not achieve (all) their goals, 43 percent exceed the planned budget, and 49 percent deliver too late. What went badly in a failed project, or what went well in a successful project, is usually hidden from view or reserved only for those directly involved.

For not only the individual developer, the development team, or the company concerned to learn from the project experience, but also our industry as a whole, a public discussion of IT projects is necessary. However, this is the exception and happens far too rarely, for example, when a $655 million space probe is lost, or when a Fortune 500 company (Hertz) sues another Fortune 500 company (Accenture) over a failed IT project.

NASA's Mars Climate Orbiter

NASA's Mars Climate Orbiter (MCO) mission is an example of an IT project whose failure is well documented. A probe intended to explore the climate of Mars was lost on September 23, 1999, due to a unit error in the navigation system. Just one week later, on September 30, 1999, NASA made the cause public: part of the team had calculated using inches, feet, and pounds, while the rest had used metric units. This statement is remarkable:

The problem here was not the error, it was the failure of NASA's systems engineering, and the checks and balances in our processes to detect the error.

The technical error that led to the loss of the probe is therefore not seen as the underlying cause, but as a consequence of shortcomings in the development process: too little communication between the teams and a lack of integration tests.

Boeing 737 MAX 8

In October 2018, a Boeing 737 MAX 8 crashed shortly after take-off in Indonesia (Lion Air Flight 610). In March 2019, an aircraft of the same type crashed just minutes after take-off in Ethiopia (Ethiopian Airlines Flight 302). In both cases all people on board were killed. In both cases the software known as the Maneuvering Characteristics Augmentation System (MCAS) was the cause of the crash.

Robert C. Martin, known to software developers as "Uncle Bob" and author of "Clean Code", analyzed in an article what is known about the MCAS software problems. His perspective is not limited to that of a software development expert. As a pilot, he also knows the context in which the software in question operates. As far as is known, in both cases the MCAS was supplied with faulty data from a single angle-of-attack sensor. Based on this data, the MCAS then wrested control of the aircraft from the pilots. The software relied on the readings from this one sensor without cross-checking them against other data such as airspeed, vertical speed, or altitude — or against the data from a second angle-of-attack sensor. Such cross-checking becomes second nature to a pilot during instrument flight training. After all, an instrument can fail in a way that is not immediately apparent. In his article, Robert C. Martin asks why the software developers did not take these basic principles of aviation into account.

Hertz vs. Accenture

In April 2019, the lawsuit filed by US car rental company Hertz against its IT service provider Accenture attracted considerable attention. The complaint, which became public, contains many details that offer deep insights into a failed IT project. In August 2016, Hertz commissioned Accenture to develop a new online presence, which was to go live in December 2017. That first deadline was missed, as were a second (January 2018) and a third (April 2018). Hertz sued Accenture for $32 million in fees already paid, plus further millions needed to clean up the mess left behind. According to Hertz, Accenture delivered neither a working app nor a working website. Although Hertz required a solution usable for the Dollar and Thrifty brands and in markets outside North America, Accenture developed software that worked only for the Hertz brand and only in North America — and even that was never completed.

It may be tempting to gloat over the fact that even large companies like Hertz and Accenture can fail spectacularly at a project. But what makes the failure of this project unusual is simply that so much went wrong and, thanks to the lawsuit, all of it came to light publicly.

Accenture did not test the developed software, at least not thoroughly or in time. The statement of claim suggests that neither automated tests in general nor test-driven development in particular were used. The missing automated tests cannot easily be added retrospectively, because it is not sufficiently clear what the code is actually meant to do. Why would it not be clear what the code under test exists for? That can be read between the lines of the description of other project problems. For instance, the Accenture developers were unable to integrate the back-end code (Java) with the front-end code (Angular) in an error-free, high-performance, and secure way.

It would be too easy, and too convenient, to place all the blame for the project's failure on Accenture. Hertz had no development team of its own that could have implemented the project in-house, because all developers had been laid off in early 2016. For the project to succeed, at least the role of "Product Owner", to borrow a term from the Scrum world, would have had to be filled in-house. Since this did not happen, the responsibility for which requirements went into the product backlog and in what order they were addressed rested not with the client but with the service provider. The resulting poor communication between Hertz and Accenture was probably one of the main reasons for the project's failure. And regardless of whether Hertz demanded it or not, Accenture committed to a go-live date. Instead of the planned Big Bang deployment, which never happened, it would have made far more sense to implement use case by use case in short iterations and to deploy continuously.

Learning from Failure

Modern development processes include retrospectives and post-mortems to help teams learn from their own mistakes. These practices help individual developers, teams, and entire organizations improve over time. It is commendable when companies publish their post-incident analyses, for example in a company blog, after something has gone wrong. This gives others the opportunity to learn from those mistakes as well.

Of course, we can also learn from things that went well. I enjoy watching videos of the "Classic Game Post-Mortem" talks from the Game Developers Conference. There I learn not only about games like Maniac Mansion or Civilization that I played on the Amiga as a child, or how programmers used clever tricks to work around the hardware limitations of the time, but also how software was developed back then, how projects were managed (or not), and so on. It always leaves me grateful that software is developed differently today.

The Chrysler C3 Project

The Chrysler Comprehensive Compensation System (C3 Project) is an example of an IT project that produced something genuinely valuable. In the early 1990s, US carmaker Chrysler set out to develop new payroll software for 87,000 employees. Development in SmallTalk began in 1994, and two years later, the software had not yet processed a single payroll run, so they brought in Kent Beck to rescue the project. Beck in turn brought Ron Jeffries on board. In March 1996, the team estimated the software would be ready roughly a year later. The C3 project entered software development history because in 1997 the team decided to change how it worked. That new way of working later became known as Extreme Programming. The estimate proved nearly accurate: with only a few months' delay, caused by a handful of unclear requirements, a first version went live, handling monthly payroll for 10,000 employees. The practices that Kent Beck, Ron Jeffries, and their team employed — Test-First Programming, Pair Programming, closer customer involvement, and above all short feedback cycles — were successfully proven on the C3 project and later formalized and popularized. They have permanently changed how we all develop software.

What Makes IT Projects Succeed

Developing software successfully means proceeding in a goal-oriented way. Those goals should flow from acceptance criteria agreed upon with the business. Without a clear definition of goals (in terms of tasks), developers risk losing themselves in their work — and in particular, they do not know when a task is done. Acceptance criteria can be documented and verified using automated tests. Either way, the goals must be defined before production code is written. That is test-driven development, whether you choose to call it that or not.

The primary task of a developer is not to write code, but to understand a problem. In his article about the MCAS software of the Boeing 737 MAX 8, Robert C. Martin puts it this way:

Programmers must not be treated as requirement robots. Rather, programmers must have intimate knowledge of the domain they are programming in.

A developer can only work meaningfully when they understand the domain in which the business they are developing software for operates. In my experience, IT projects work poorly when "the business" communicates with the "cost center IT" only through tickets, with no context as to why something should be changed or implemented. And IT projects work well, in my experience, when the software developers understand how the business works and how a change can contribute to its success. This is achievable when everyone involved recognizes that software development is more human-to-human communication than human-to-machine communication.