Computer error: reliable digital processing

Valves give way to transistors

Various ingenious ways of operating were developed to ensure that the machine would work for a sufficient interval to achieve useful results. The LEO I computer of 1951 demonstrated that computers could move out of secret military establishments and universities into the commercial world, giving rise to the first business machines. However, the dubious reliability and large physical size of these early machines, not to mention their enormous power requirements, kept them firmly fixed in large, well-ventilated rooms.

In 1953, the world’s first computer using transistors became operational at Manchester University. It used 92 point-contact transistors, 550 diodes and a few valves in the clock circuits. A Mark 2 machine started work in 1955 with 200 transistors. Unfortunately, first-generation transistors were no more reliable than valves and the computer could only manage an MTBF (mean time between failures) of 1.5 hours. Nevertheless, the new computers consumed far less power and took up less space than their predecessors.

It took a while for the relatively new transistor technology to be accepted. In 1956, Ferranti launched their Pegasus computer, designed for high-speed computation — and still full of valves. The approach taken here was to reduce the MTTR (mean time to repair) factor by introducing modularity. This appears to violate Rule 1 of fault avoidance, no redundancy, by increasing the number of components, particularly connectors. However, the variety of modules was kept to a minimum and replacement once a fault was traced was very quick. Fig.2 shows a Pegasus module, complete with convenient handle to pull it out. Note also that the valves are placed so that a blown heater filament can be quickly spotted. Fig.1 shows a Pegasus computer image from contemporary literature.

The era of valves was inevitably limited, especially with the arrival of junction transistors, which were a vast improvement both in performance and reliability compared to early fragile point-contact designs. The all-transistor LEO III appeared in 1961 with the same modular structure as Pegasus (Fig.3).

A further advantage of transistor-based circuits was that they were much safer to work on, thanks to the lower operating voltages required by solid-state devices (transistors and diodes). Fault finding on a valve-based computer required great care when prodding about with a ‘scope probe on live circuit boards. A useful tip, presumably learnt the ‘hard way’, was for technicians to keep one hand behind their back, with the probe in the other to avoid inadvertently placing several hundred volts across their heart.

A key feature of early computers was that they were used in non-safety-critical applications, so as long as they kept going during the working day and faults could be quickly traced and fixed by the next day, then their reliability was considered acceptable. However, once computers started shrinking in size and power consumption, thanks to solid-state technology, mobile applications became possible. The snag was that reliable operation now meant safe operation and the question was soon asked… ‘Would you trust your life to a computer?’

Computers into space

The race to be the first to land a man on the moon in the 1960s led to rapid developments in a whole range of technologies, the most obvious and far reaching being solid-state electronics. The real-time calculations required to operate a vehicle like the Saturn V rocket, or navigate to a point on the moon hundreds of thousands of miles away, required fast, powerful and reliable computers — and those computers had to fly on board the space vehicle.

NASA’s Apollo Guidance Computer (AGC) was the result, and was ready by 1966. Unfortunately, it became known to the general public in 1969, not because of its amazing performance, but because it seemed to get overloaded at a critical moment, and Neil Armstrong had to take over minutes before landing on the moon. It had performed faultlessly up to this point and only got into trouble when a switch was left in the wrong position. The machine itself consisted of two parts: the main electronics box and the Display-Keyboard (DSKY) on the control panel (Fig.4). Both the lander and the command module had one of these incredible machines.

The main electronics box was made small enough and light enough to be carried thanks to the first integrated circuits: the manned flights used an AGC based on 2800 Resistor-Transistor Logic (RTL) dual 3-input NOR gates (Fig. 5). The use of only one type of technology yielded a much more reliable system than others based on mixed Diode-Transistor (DTL) and diode logic. The surface-mount devices in the picture look remarkably modern, but each only contains six transistors and a few resistors.

Launch Vehicle Digital Computer

Before leaving the topic of the Apollo computers, there is an unsung hero that should be recognised: the Saturn V rocket Launch Vehicle Digital Computer (LVDC), shown with the Saturn V project director designer Wernher von Braun in Fig.6. Most of the round-capped connectors on each side are electrical, but some are for cooling water hoses. The real-time control operations involved in driving a vehicle to the moon are far too complex for the human brain to deal with, so the Saturn V rocket had a digital autopilot. In essence, an on-board computer was loaded with 3-dimensional coordinates for a point in orbit prior to launch, a ‘Go’ button was pressed and the astronauts relaxed while the LVDC piloted them into space.

The LVDC system was designed so that the rocket commander gripped an abort handle in case something went wrong — which it did with Apollo 12. Seconds after lift-off, all telemetry from the rocket went haywire. The astronauts and the support engineers didn’t know it at the time, but the rocket had been struck by lightning, which threw half the spacecraft electronics into a spasm. Fortunately, one of the flight controllers at Mission Control thought he recognised the symptoms and suggested to the astronauts that they just needed to turn an obscure switch to ‘AUX’. It worked, and telemetry was restored. Meanwhile, the rocket had just kept on going on its preset course. The ability of the LVDC to cope with this emergency was probably down to its extremely advanced architecture, involving ‘triple modular redundancy’ (TMR) with voting logic, a concept which will be discussed in detail in Part 3 as we move from fault avoidance to fault-tolerant design.

The Apollo moon-landing project proved that computer technology was now reliable enough to perform real-time tasks in situations where failure could lead to loss of life. Initially, the people trusting their lives were astronauts and military pilots, who were prepared to accept the risks involved. After all, given their professions, the added risk of computer failure probably seemed small alongside all the very obvious physical hazards.

Failure is not an option

Civilians on the other hand do not wish to factor-in the likelihood of computer error leading to a fatal outcome for their holiday flight plans. The Airbus A320 airliner first flew in 1984 and featured the first use of ‘fly-by-wire’ controls on a civilian aircraft. From that point, the design principle has been that the computer must not fail, or if it does, the failure must be predictable and benign. This means the machine must detect faults itself and deal with them. The only way to do this is through the use of redundant components: two or more processors all running the same program (but not necessarily the same code) and their outputs compared before control actions are taken. These computer architectures not only improve safety, they also improve reliability in applications such as space probes, where repair is simply never an option. Some of the latest multi-core microcontroller devices developed for the automotive and medical equipment markets have this redundancy built in. Next month, we conclude this series with an examination of the latest thinking in computer reliability design philosophy.

Like this post? Please share to your friends: