Ariane 5 Flight 501 Failure

Notes and work on the systems engineering lessons from the Ariane 5 Flight 501 launch failure. Often noted as an example of a programming bug or error, the Board of Inquiry report shows how the launch failed as a result of systems level design and engineering errors.

Orcmid started a web page on the topic at:

http://nfocentrale.net/orcmid/writings/W040400.htm

Update: 2004-06-14: DeMarco/Lister Reference to the Ariane 5 failure was found in Waltzing with Bears.

Bill started a letter to Dr Dobb's Journal in response to the May 2004 editorial:

Jonathan Erickson's May 2004 editorial ("Quit Buggin' Me!") is informative -- it's always good to be reminded of our continual and growing dependence on software-based systems. And references to sources of good debugging tools and practices are also welcome and needed.

However, using the Ariane 5 rocket launch failure as an example of a high profile software bug is wrong and reenforces an "urban legend" of software engineering. The Enquiry Board Report on the Ariane 5 event clearly indicates that the launch failure was not caused by a software bug, but rather by a failure of sytems engineering, including design, deployment, and testing. [Reference to report of enquiry board]

Erickson defines a "bug [as] any fault in a program that causes the program to do something different than what was intended." It is true that there was a software overflow resulting from division of a 16-bit .... However, the report clearly shows that the software module was knowingly coded to throw an unchecked exception in the case of an arithmetic overflow error. By Erickson's definition this behavior does not reveal a fault -- the module performed as designed.

The Ariane 5 flight 501 termination was the result of a systems engineering assumption; viz., any exception thrown by the inertial guidance software must indicate a hardware failure. In other words, the software was assumed to be correct -- an assumption that is never really warranted in software engineering.

--

Another reference to the Ariane 5 event is contained in the following blog entry criticizing a recent IEEE Computer article.

http://www.cincomsmalltalk.com/blog/blogView?showComments=true&entry=3266964593

Excerpt from software fault tolerance course outline that mentions the Ariane 5:

Software Fault Tolerance – (based on slides by Jörg Kienzle) Slide 1 Software Fault Tolerance Overview http://wwwse.inf.tu-dresden.de Prof. Dr. Christof Fetzer Systems Engineering Group TU Dresden

Software Fault Tolerance – (based on slides by Jörg Kienzle) Slide 8 Ariane V, June 4th 1996

• IRS raises an Operand Error while converting a 64bit float to 16bit integer

• Operand Error cause by high values of Horizontal Bias, which is normal for Ariane V

• Function serves no purpose after lift-off in Ariane V; Ariane IV, however, needs it for 50 seconds

• Not possible to switch to backup IRS, for it had failed as well (72 ms earlier)

• On-board computer interprets "core dump" data as normal flight data

• Full-nozzle deflection of solid boosters and vulcan enginer

• Angle of attack > 20 degrees

• Separation of boosters from main stage

• Self-destruction after 39 seconds

--

Update: 2004-08-20

A couple of email notes from orcmid on references to the Ariane story:

Note 1:

I found two more articles about Ariane 501. The first is in Sommerville's on-line supplemental materials to the 7th edition of his software engineering book: . There are slides and also a paper that analyzes the problem in terms of programming-language silver bullets.

Note 2:

I forgot to say where the second occurrence of Ariane 501 was. Here's the Bloglet that I captured on the topic:

ACM News Service: Controlling Software Component Quality.  The QCCS project raises the Ariane 501 example as a contrast with the sort of contract that can avoid calamitous errors. QCCS is offered to deal with underspecification of interfaces and employ AOP as well. This will be great for a reference from the Ariane 501 work and also something to dig into around abstractions and modeling.

The IST release on Quality Controlled Component-based Software  (QCCS) provides links to the project pages and related information.  It also correctly identifies the underspecified behavior of the Ariane 4 module that could have been noticed on review for Ariane 5, if the specification had been complete enough for the failure case that was observed.

--- Update: 2005-02-07

Personal statement from RP Feynman regarding the official report on the Challenger shuttle failure.

http://science.ksc.nasa.gov/shuttle/missions/51-l/docs/rogers-commission/Appendix-F.txt

This is a terrific explanation of the cons of top-down system construction and testing. It makes me even more convinced that the agile approaches for software development that do bottom up composition of unit tested components is one good path to improving software system quality.

Update: 2005-12-22

Here's another mistaken reference to the Ariane 5 problem as a software bug.

http://www.wired.com/news/technology/bugs/0,2924,69355-2,00.html?tw=wn_story_page_next1

And even worse than the mistake is that the article is referenced in ACM TechNews.

End of year e-mail from Dennis:

From: "Dennis E. Hamilton"  To: "'William L. Anderson'"  Subject: Ariane 501 Again Still Date: Fri, 30 Dec 2005 06:59:24 -0800 Organization: NuovoDoc

Vicki got up at 3am to take a friend to an early flight, I woke up at 3:30 as she was leaving and I was rested enough that I couldn't go back to sleep. Ariane 501 has come up again. This time as one of the "ten great software bugs of history." And from Simson Garfinkel, which is really disappointing:



So I was aroused enough to start making notes on a narrative to finally debunk this thing.

I BlogJetted a couple of paragraphs, then started looking for my original notes (from April 2004, the last time I started to document this on my site) and outlining how to address this. I have four pieces:

A. What Happened - a timesequence of events from the design of the Intertial Reference System (SRI) for Ariane 4, through the run-up to Ariane 501 and the flight simulation that assumed the SRI was good, then the events of the flight, the findings of the Board of Inquiry, and the aftermath. This is a document by itself.

B. What We Made It Mean. All of the speculative nostrums that are claimed to be preventatives, what smart people made of it and how that did or did not suit the facts, and the creation and promulgation of a folk-legend about an overflow crashing a computer that destroyed a launcher in flight.

C. Getting the Lesson - the real lesson that every student in a real-time embedded software engineering class should be able to recite like a marksman disassembling and reassembling his weapon. Then the IT version (since there are lots of present-day practical systems lessons here, and one has to do with the difference between competence and performance).

D. What is the Lesson We Want It to Be, as opposed to what it is. Why is it important for this to be seen as a bug? Maybe because we don't want to believe in the importance of process, and it was a process failure? The Board of Inquiry had no illusions about that. Why do we?

Anyhow, I am recording what I can about A, outline style.

Sheesh.

---

UPDATE: the Ariane 501 failure once again made the annual list of major software bugs. Dennis Hamilton posted a reply to the Risks-Digest that shows that it was not a software bug that caused the termination of Ariane 5 Flight 501. We continue to refuse to pay attention to the documented history. - Bill A

Sent: Friday, September 17, 2010 18:47 To: 'risks@csl.sri.com' Subject: Ariane 501: Not that Kind of Bug

I find it distressing that the loss of flight Ariane flight 501 is repeatedly taken as demonstration of a software bug, as in the risk of (now 11) software bugs cited in Risks Digest 26.16. I think this covers over a far more important lesson and trivializes the incident. See .

Here's another way to look at this. If this was a programming error, what is it that the programmer could have done, for the Ariane 4 Inertial Reference System as actually designed and constrained, such that the Ariane 501 mission would not have been lost?

Consider that

(1) The developers knew that a down-conversion to 16 bits could result in an overflow, so it wasn't a lack of numerical analysis. Under the parameters of the design, the apparatus would never see conditions in which the value produced would be out of the 16 bit range.

(2) The developers knew that a down-conversion to 16 bits could result in an overflow, so a proof of correctness would not help. The correctness proof, if undertaken, would confirm that the values outside of the range for the design conditions would not occur, based on external facts (physical circumstances) that would have to be accepted as hypotheses supporting the proof.

(3) The developers (using Ada, which has provisions for catching an out-of-range attempt for down-conversion), intentionally allowed an uncaught exception to be thrown under the agreed conditions that the exceptional values could only happen if the hardware was failing.

You can surmise that (3) turned out to be a problem, even though it was correct that, under the conditions of Inertial Reference Platform usage, including for Ariane 501, there would be and there was no down-conversion failure.

The problem was that the Inertial Reference Platform was left running after launch when its output is not only not needed but is useless, that the unit and its backup both shut down (same problem, not equipment failure), and because of the shutdowns, the guidance system was fed a diagnostic message instead of guidance data, and for some reason, it was not designed to recognize such a thing.

So there is nothing different that was available for the software developers to have done on their own. The failures of system engineering, on the other hand, are quite educational.