Engineering Considerations of the LARC

Norman Hardy. Last updated June 1998.

I arrived at Livermore in 1955 as a programmer and an early assignment was to visit Philadelphia to learn about the logic design of the yet undelivered LARC. It was a fascinating experience.

The machine was decimal. The word held 12 digits. An instruction was formatted as TIIAABBMMMMM. II was the op-code. A register field in the instruction was two digits but our machine had just 26 general registers. AA selected the register operand and the contents of register BB were added to MMMMM to form the effective address. T might indicate a normal instruction, a traced instruction, or indirect addressing.

Our LARC had 12 non-interleaved core memory boxes each with 2500 words. An entire memory cycle for a box was 4μs. The machine ran on a global 4μs cycle divided into eight 500 ns slots. Each circuit in the machine would carry a new boolean value each 500 ns. Memory latency was 5 or 6 slots, I recall. The major units in the machine, except core boxes, were tasked according to the current slot’s identity. Thus the 8 slots on the memory bus were preallocated to these 8 distinct functions:

Processor 1 data access,
Processor 2 data access,
IO processor instruction fetch or data access,
Processor 1 instruction fetch,
Processor 2 instruction fetch,
DMA 1 access,
DMA 2 access,
DMA 3 access.

I am unsure that there were indeed three DMA units. I think that there was only one processor on each of the LARCs that were delivered. Every LARC had a specialized IO processor that tended to the real time aspects of IO.

The 26 general purpose registers served as index registers, fixed registers and floating registers. They were hand wound cores with a one μs cycle time and one μs latency. Memory addresses 99900 thru 99925 referred to these registers. Use of these addresses incurred a 4μs penalty.

Unlike current RISC machines, there were few adders. The main adder, allocated according to the 8 slot schedule, would calculate an effective address on one slot, an instruction address on another, a fixed point add result or the mantissa of a floating add on yet another. The adder was not, however shared between processors, as was the case with the PPUs in the 6600.

The multiplier used a decimal version of carry-save add. The designer told me that the idea was already ancient. Wikipedia says that von Neumann invented this idea. Sequentially dependent floating adds would proceed at 4μs each. At most one instruction could be issued each 4μs. If an instruction modified an index register and the next instruction used that register in its effective address calculation then there was a 4μs penalty.

Given the above we can reconstruct a rough description of the degree of LARC pipe lining. Here is the schedule of events for one floating add instruction in a stream of sequentially dependent floating operations—the number at the left is the relative slot number:

0: Calculate the address of the instruction.
1: Summon the instruction.
7: Decode instruction and summon index register value.
9: Calculate the effective address.
10: Summon the core operand.
14: Summon register operand.
16: Initiate floating add sequence.
23: Send floating sum to register.

The machine did register forwarding for sequentially dependent floating and fixed operations. Such a stream issued an instruction every 4μs. (Floating multiply took 8μs and floating divide took 28μs.) You might describe the degree of pipe lining as three. The 7090, a contemporary machine from IBM was slightly more than one, as I recall. The Stretch was rather more than three.

Comparing Stretch and LARC strategies leads me to the following points regarding allocation of hardware units to logical tasks:

The LARC benefited from clever engineers who decided at hardware design time how to allocate hardware. They were in a position to modify the design of other units to achieve clever melding of the schedule.
The Stretch benefited from late binding of allocation of hardware to task.

Both the LARC and the Stretch could access general registers using memory addresses. In both cases this was a bad idea. While it seemed an easy generalization that make code easier and shorter, it seldom resulted in a faster program for the implied data flow was at odds with the rest of the machine design. There were cases where it was faster for a longer program to move the data via core. The next generation of machines included instructions that explicitly addressed more than one register for operands and that was indeed strategic.

The memory addresses for the LARC registers were of the form 999XX. The assembler accepted symbolic values for register designations in instructions, both 2 digits and 5 digits. The assembler required that the three discarded digits in preparing a 2 digit field be 999. This had the beneficial result of producing an error when an address of a core word was accidentally used to name a register. I have needed and missed that warning on subsequent assemblers. One such missing warning caused more than 10 crashes in a production system.

The timing rules were easy to understand for the LARC and were indeed well understood two years before the machine was shipped. The timing rules for the Stretch were hard to understand and few understand them well. This understanding came about only a year after the machine was shipped. While the Stretch missed its speed goals (and the LARC didn’t), it was less late and was also a rather faster machine.

Both machines had random access mechanical memory (moving head disks and drums) and their performance characteristics was another saga.

Chuck Leith tells of the first application on the LARC.

An anecdote that corroborates that the LARC had the first “general registers” is the origin of “26” in the fact that Livermore’s machine had 26 general registers. Rem-Rand asked how many registers we needed. We thought that 16 index registers felt right (the 704 architecture suggested a power of 2 for index register count) and about 10 accumulators, soon they said that the index registers and accumulators would be unified. 16+10 = 26.

I was off at IBM during initial negotiations between LLNL (then LRL) and CDC. The register file idea exploded throughout computerdom. Thereafter only small machines had a single accumulator. Perhaps Stretch was the last big machine with just one.