2. Apollo Guidance Computer

A relatively simple system (by today’s standards) built in the 1960’s for the Apollo moon landing, the Apollo Guidance Computer (AGC) hardware, software, and UI were designed together. The technology looks primitive in retrospect, yet the system was resilient in the face of badly behaved external systems and occasional user error. It also made extremely good use of its hardware.

Introduction

The AGC is a system, meaning simply that it is made of interacting components. Among them are the computing hardware itself, the simple display and keypad (DSKY), the rendezvous radar system, and the astronauts. The users are often an important element of any system model.

I chose the AGC as a case study in building a resilient system, both for its successes and its shortcomings. This well-documented system could be easily studied for other reasons, but resilience is easier to study when a system is small and requirements are precisely known.

Note the following questions before you do the reading/viewing.

What are the components of the AGC and how do they interact?
What kinds of failures did the software team anticipate?
How did the system recover from failures?
Stepping back from the AGC to consider systems in general, how do we measure resilience? What does it mean for a system to be reliable?
With so little available memory, the AGC software team faced a major challenge. Duplicate blocks of code must be avoided to save space. That suggests turning commonly needed blocks into functions, and implementing a function call instruction. But memory for a stack was scarce, and registers were dedicated to specific uses. How did they conserve ROM (where code was stored), RAM (working memory), and registers?

We pause to note that Margaret Hamilton, who led the flight software development team for the Apollo program, coined the term software engineering. She observed that the design and implementation of software should be seen as a distinct process from hardware engineering. This may seem obvious today, when almost all hardware is “commercial, off the shelf” (COTS). Earlier, building a computer system meant engineering the hardware. Writing the software was considered a secondary activity to engineering, a process considered to require education, skill, experience, and rigor, and which was held in high regard.

A Complex System

First, observe that even in a simple system, there a lot to explore. As soon as we have multiple system components (including humans) interacting with each other, we have a complex dynamic system.

By dynamic, we mean that the state of the system changes over time. The term complex is used here informally to mean that the system behavior over time is not easily described. By contrast, a dynamic system in a steady state may exhibit ongoing changes while maintaining a form of equilibrium. With respect to discrete states, we may observe cycling through a sequence of states when a system is in a steady state.

The term complex usually suggests non-linearity, which describes a system whose output does not change proportionally to its input. Most interesting systems are non-linear, and their behavior may appear chaotic. As you would expect, such systems are hard to model. Engineers and scientists often resort to linear approximations.

We will explore the topics of system dynamics and behavior later in the term. For now, we merely note that the overall behavior of a system as simple as the AGC may be quite complex.

Required reading/viewing

The videos are quite high-level, so you are likely to have many questions. This is good! Your questions will fuel our discussion.

Interview with Margaret Hamilton (6.5 mins; Hamilton was the lead software engineer)
Apollo 11’s “1202 Alarm” Explained (7 mins; exposition on a certain feature of the AGC)
The Real Story Behind the Apollo 11 Computer Error | WSJ (7 mins; includes interview with a lead programmer who worked for Hamilton)
Apollo 11’s “1202 Alarm” Explained (Archived article from Discover Magazine)

Additional resources (optional)

AGC on Wikipedia
What the Errors Tell Us (Interview with Hamilton; IEEE Software, Volume 35, Issue 5, September/October 2018.)

Investigation questions

DESIGN GOALS. In software engineering, we make a distinction between requirements and constraints, though sometimes they blur together. What were the requirements/constraints that shaped the design goals for the AGC?
AGC COMPONENTS.
a. What are the components of the AGC system? Consider the CPU and its memory as the single central component. To what other items or modules is the AGC connected?
b. What is “core memory”?
c. How many words of memory was in the AGC? What was the word size in bits?
d. How much power did it consume?
e. How big was the AGC physically?
f. Any other notable aspects of the system?
ALARMS. When the AGC issued “alarm codes” 1201 and 1202, it was (famously) an unexpected and frightening event.
a. What did each code say about what the AGC was doing? What was the root cause of these alarms, and how were the alarms resolved?
b. Some external systems provide data to the AGC by causing interrupts. How do interrupts work?
c. Suppose the AGC must periodically check the status of some external component (without receiving an interrupt). How does the AGC execute periodic tasks?
USER INTERFACE. The AGC had a UI for the astronauts to use.
a. How do the astronauts provide input to the AGC? Give some examples.
b. How did the AGC display information to the astronauts? Give some examples, apart from the alarm codes 1201 and 1202.
c. Speculate on why the input/output devices were custom-designed and so very limited.
ERROR RECOVERY. The AGC is an early example of a software-based system designed to be reliable.
a. What was considered to be an AGC error?
b. What kinds of errors were recoverable?
c. How did the AGC recover from 1201 and 1202 errors?
d. What is a good definition of system reliability? Is this the same as availability? Performance?
SYSTEM RELIABILITY.
a. What is a good definition of reliability? What about system reliability?
b. What are some measures of reliability? How does the reliability of a component relate to the reliability of a system?
c. What is availability? How is it measured?
d. Why might we want to understand how long it takes to repair a system, to get it running again? Is the mean time to repair a useful measure?
CONTROL PROGRAM. The AGC is a single CPU that was programmed in assembly language. Yet, it had to do multiple tasks simultaneously.
a. How do interrupts and timers make it possible for a single CPU to appear to be doing multiple things at once?
b. There were several “routines” that needed to be performed within multiple different AGC “programs”. They devised a technique that today we might call a bytecode interpreter. What is bytecode (or “high-level instructions”)?
c. Compare and contrast the AGC “bytecode” interpreter with other designs that you are familiar with, perhaps the Java JVM or the Python interpreter.

Software Systems Design (FIRST DRAFT)