Physics before Floating Point

Long ago when computers were mostly one of a kind, and some even lacked a subroutine facility, a few computers had some sort of floating point. A mechanical computer built by Konrad Zuse in about 1940 had 24 bit binary floating point with several ideas that resurfaced in IEEE floating point.

John von Neumann was highly influential in most subsequent computer design and von Neumann thought that floating point was economically and numerically inferior to scaled fixed point. A generation or two of machines appeared without floating point in about 1953. These generations were the first where several machines of the same design were produced. Complex physics programs already were appearing and provided much motivation for these new machines.

This note is about the practice of scaled fixed point approximations to real numbers. This goes as well for other sciences but I think physics was the pioneer user of the early computers. The IBM 701 took after the Princeton machine in many of these details, as did the ERA 1103. The Princeton machine was 40 bits. The 701 and 1103 were 36 bit machines for each of which multiple identical copies were manufactured.

The IBM 701 had two 36 bit programmable registers, the AC and the MQ. (Excluding the program counter.) The machine was sign and magnitude; the left most bit was 1 for negative numbers and inverting that bit of the representation of x yielded the representation of −x.

Add and subtract were like modern two’s complement machines unless negative numbers were involved. The multiply instruction left the 72 bit product of memory and the MQ value, in the combined AC and MQ. A long shift instruction operated on the combined registers to shift the resulting double number to adjust for scaling and put the result in either the AC or MQ. The divide instruction left the quotient of AC-MQ in the MQ. The remainder was left in the AC. These shift commands did not change the sign.

Compilers were scarce and did not support these operations well. Coding practice was as if there were these C functions:

Or in more mathematical notation the functions are: In these expressions the arguments and function values are interpreted as signed integers with magnitude less than 235. The value for s is constant for a given expression and comes from the instruction stream. The shift amount was seldom modified during program execution. It was frequently modified during tuning to a particular application, or adaptation of code to another application. The multiply could be carried out by machine instruction such as: Note that this code does not check that the product fits in the one word field.

Divide could be done as:

The 701 would stop with the “divide check” indicator lamp on if the quotient didn’t fit in the MQ.

Global Scaling

Of course when doing physics these bit patterns are not interpreted as integers. Each variable is given its own scaling as a power of two; another level of interpretation. Each variable required a pre compile decision about scaling. Typically the decision was tantamount to assertions such as:
|x| < 29. The exponent of 2 might well be negative. It was common that |x| < 29 and |y| < 2−3 but also that |xy| < 24. In this case the last inequality does not follow from the first two. This is a common situation in physics calculations for the two factors are correlated and do not attain their max at the same time. Such assertions were often wishful thinking that could be violated by unexpected physical situations. Discovering easily that the assertion was violated required extra code that was not always included. When scaling decisions were violated it was necessary to modify all references to the variable. Scaling of subexpressions might also require application insight. This was a particular obstacle in designing a good language for these machines.

A testable one bit indicator called “overflow” would be turned on when an add or subtract result exceeded the range of fixed point values. A long left shift would also turn on this indicator(?). Occasional testing of overflow would alert that scaling assumptions had not been met.

This 1954 description of the Los Alamo Maniac computer includes a discussion of such matters.

Interpretive Floating Point

Speedcode was a 701 program that provided what might be called today a “virtual machine” with floating point and index registers. John Bakus, later of Fortran fame, pioneered Speedcode. I did not use Speedcode.

Modern Machines

Today’s machines do not support this sort of arithmetic very well. Floating point has displaced this niche in general purpose machines. Scaled arithmetic requires a product that takes one word from somewhere in the double word product. Scaled divide is likewise difficult. Today most machines make such calculations awkward. C and other languages do not express the task well either.

Digital Signal Processors retain some of this arithmetic flexibility but the few that I have looked at are not as general purpose as the 701 in this regard. Apple’s graphics software on the 68K (which lacked floating point) provided the subroutines m(x, y, 16) and f(x, y, 16) which ameliorated these problems.

Language Support

There were some ideas proposed for ADA in support of such practice. Here is what I would propose for a higher level language. Provide a parameterized scaled type effectively declaring that the variable would be some multiple of r, a rational, and within a definite range of values. The range part is like Pascal and perhaps one of the ADA proposals. If the programmer always chose r as a power of two then adds could be done with fixed shifts, as it was done in 1955. The sum and the product of two numbers of different such types, is of such a type. If the declared ranges were less than the memory words, then not all additions would require a compiled overflow test. Applications that need fixed decimal behavior are naturally and efficiently supported in binary formats. Special notation that I will not try to invent here would sometimes be necessary to scale the subexpressions.

Hardware Floating Point

Livermore had fairly complex numerical calculations and few there dealt with issues of precision. Empirically production programs ran better after translation to a platform with more precision. Until the 360 there were no common machines which provided more than one floating precision. The earliest binary computers adopted a 36 bit fixed point format that would be scaled by program logic—each variable and subexpression was scaled by some power of 2 by explicit program logic. When the 704 introduced floating point it was argued that the loss of 9 bits of precision that the exponent displaced would be made up for by providing more precision in those cases where compile time scaling was too pessimistic. The extreme convenience of floating point quashed all contrary arguments. Fixed point physics was dead! Several other systems adopted the 36 bit floating format, with variations. Burroughs and CDC computers adopted 48 bits which held a floating value more comfortably. Later the Stretch and the 6600 moved to 48 bits of floating precision within 64 and 60 bit words respectively. Numerical problems in the big production codes diminished with this new precision and there was little incentive to seek problem formulations requiring less precision.

The 360 defined what real number was represented by any particular combination of bits within a floating point value and baldly declared that add, subtract, multiply and divide would in each case produce that floating point value with the largest magnitude not greater than the mathematically exact value (round towards zero). Out of range exceptions were, of course, noted. The model 91(??) used a divide algorithm that might be one bit off in the last place. This was documented. It was a very fast divide.


These ideas can accommodate virtual decimal as well. Suppose some program specifies that it must handle decimal numerals with m digits to the left of the decimal point and n to the right. We call that a (m, n) numeral here. If x is an exact value that can be so represented then 10nx is an integer that can be stored in a binary field of ceil((log 10/log 2)(m+n)) bits. The binary field is signed just if the virtual decimal field is. With this convention a virtual add is a real add when the digits to the right of the decimal point are the same. Otherwise a multiplication by a power of 10 (expressed in binary) is necessary. Rounding the virtual decimal field, for significance shedding, amounts to binary rounding. To convert a (m, n) decimal (whose binary rep is B) to a (m, n') decimal B' where n' < n, perform the binary computation
B' = (B + 10n−n'/2)/10n−n'
rounding towards zero. Sometime the division can be replaced by a multiply.

Un Normalized values

Many of the machines provided unnormalized floating point add and subtract. Floating multiply normally provided only one bit of normalization which always sufficed for normalized inputs. Floating divide required normalized divisors in later machines and would also provide only one bit of normalization. The unnormalized operations were used mainly on logical data and in algorithms that were very sensitive to the floating format.

The CDC 6600 provided no normalized add or subtract, but provided a fast normalization command. Some of Livermore’s production codes initially omitted normalization by default, except as divisors. Tracking down the consequences of this turned out to be too expensive and most or all applications normalized all adds and subtracts. One particular problem was that the product of two numbers with j and k leading zeros in their respective mantissas, would have at least j+k−1 leading zeros in the mantissa of the product. This was worse than a pessimistic estimate of the precision.

The above is direct personal experience. The following is an indirect report of an attempt to do matrix arithmetic on significant matrices with arithmetic that carefully traced the precision of each intermediate operand to one bit as in the unnormalized adds. Special attention was paid to multiplication and division to preserve the precision estimates. Some matrix calculations that were giving useful results that agreed with observation, produced zero precision when the precision was tracked. I do not know whether they had super precision matrix arithmetic to checkout these calculations against the mathematically exact values.

I hear that interval arithmetic fared better in these cases. Still it seemed that interval arithmetic was in practice unduly pessimistic.

Floating History