Making Computers do the Right Thing

There are two broad schemes for programming computer systems which I will characterize here as the Math and Engineering approaches. A Math design endeavors to get every last detail correct and expects the computer to do a large number of operations without error. An engineering design proceeds with plans to tolerate errors and includes failure modes formally or informally.

The Math approach has been surprisingly successful in some astoundingly large tasks, at least by standards of a few decades ago. Calculating pi to many billions of digits has required about 1017 operations without error. The results have been checked. Similar more useful calculations have successfully adopted the Math approach.

The claim about 1017 operations requires qualification here. This site recounts in detail the saga of failing CPU’s and disks and the extra code to detect these failures. The claim is not that the 1017 operations were executed without error, but that the checked calculation could not have succeeded except for the code expressing the algorithm being correct. There is no explanation of the success of this calculation other than that the code was right and the concatenation of good runs indeed amounted to 1017 consecutive operations. The engineering described to tolerate and overcome these errors is worth study.

People are not surprised when Windows or Mac OS crashes and there is evidence that the hardware is not to blame. The builders of these systems have shipped when they deemed the level of problems acceptable. This is an Engineering tradeoff. I think that both of these systems suffer from two different kinds of faults, design flaws and implementation flaws. An implementation flaw is where small editing change to the source will prevent crashes. A design flaw is where there is no algorithm much like the present one that does not fail. The prime example that I am aware of is the unprotected boundary between the stack and heap on the Mac OS. Arbitrarily obscure mysteries arise when the user commits the sin of merely allocating too little space to the application or asks the application to do too big a job.

Some Engineering designs must be so for their very problem statement involves tradeoffs, such as programs to fly an airplane. In such cases a late answer may well be worse than a wrong answer.

Middle Ground

It is interesting to examine how these two schemes meet.

I know of no program that can tolerate even a 0.0001 error rate in conditional branch decisions. I suspect that the flakiest of large programs requires an error rate of less than 10−8. No successful programmer is innocent of the Math viewpoint.

Most crypto algorithms that are shipped are error free. (The entire crypto suite is another issue as is the correctness of crypto protocols.) There is one “leap of faith” which is the belief that there are no algorithms which will decrypt the data without the key in a “reasonable” time.

Much of the Keykos design has a Math style of design.

Perfection from Imperfection

Von Neumann showed in the early 50’s that it was possible to build reliable digital hardware from unreliable parts. Error control disciplines have shown that these ideas are indeed very successful. Perhaps the extremes of this art are when noisy data channels from space probes have error rates approaching 50% and yet transmit millions or billions of bits without error.