Order of Execution

I have had an itch for some while to write about ideas to order calculations inside a computer. The ideas that I want to write about are very varied and so is the audience to whom I want to write. So I write first and organize later. A related issue on which I have little to say that is new, is how the concepts of programming languages bear on the order that computers take steps to obey the program. A year later ... I am still not near to finding an organization for these thoughts and yet they seem ever more important with the advent of ubiquitous multi-processors.

Here is a page that says much of what I want to say.

Starting at the bottom we first review how just about all current computers work at the hardware level. So little has changed here in the last 50 years.

Modern computers fetch instructions from successive locations in memory and obey them. In each instruction there is an op code (operation code) that selects some specific action to perform. The selection is from some fixed set of operations built into the hardware and that the hardware can carry out without further reference to other instructions. In just about all current computers these operations may call for modifying a register or memory location by inserting a new value resulting from some very simple calculation that depends on the content of just a few other specified registers or memory locations. Registers are few and fast, memory locations are vast and as fast a registers of machines just a couple of decades ago. Here is some extensive information on these machines.

Instructions are normally fetched from sequential locations but among the op codes on any machine are conditional branches that change the location from which to fetch the instructions, contingent on some register value. Alas the three words branch, jump and transfer have each been used historically for the same function and it seems awkward to me to describe a particular machine in terms other than those originally used for that machine.

IBM 701 (1954) had just two registers: AC and MQ. Machine code to call a subroutine, at the call site, would put its own address in the AC and unconditionally branch to the routine. A prelude in the routine would modify the address of the call site in the AC to become an unconditional branch back to the code following the site. That branch would then be stored at the end of the routine. This scheme is neither recursive nor reentrant and the defined semantics of Fortran II subroutines reflected this. Until the middle 60’s it was common for machines to include call instructions that stored the return address at X and branched the instruction following X. This supported Fortran semantics (of the day) but not Algol which practically required a stack and recursion.

The IBM 704 added index registers and a “TSX” instruction that would branch to an address but leave the address of the TSX in an index register. A single unmodified branch could use that index register value to return. The index register was big enough to hold an address and was used mainly to contribute to effective addresses in those instructions that held named that register in a small field. The effective address was the sum of the index register value and an address included in the instruction. The 704 had three registers. Subsequent machines had from 1 to 16 index registers. Index registers ended the main reasons for code that modified itself.

See this note on recent perspectives on subroutines.

Coroutines were known but little used. The big Livermore mesh calculations would pass over the data in a repetitive and logically simple fashion. The subroutine, the loop and the conditional block were just about the only constructs to order the calculation. Subroutines were nested, of course, but almost never recursive. When we got a CalComp plotter at Livermore in about 1958 I wanted to plot a Koch curve. Being recursive in nature I decided that the easiest way was to duplicate a short Fortran subroutine 7 times and adjust each routine to call the next. The duplication of the Fortran source was via a card reproducing punch.

I/O

In addition a computer has I/O equipment via which the computer can read and write information in forms accessible outside the computer.

Before about 1960, to do I/O the program would issue a read or write instruction that specified the I/O device and then issue copy instructions as each word was ready to be transferred. If there were too many instructions before a copy instruction the I/O function would fail. Among the big commercial binary computers, the IBM 701 and 704 machines, and the ERA 1101 and 1103 computers were of this design. Here are some notes on early I/O programming.

In about 1953 the Univac tape units included a 60(?) word buffer and would anticipate a read so that when the program came to the point of reading, the buffer would probably already hold a block of data that would be quickly transferred to main memory (mercury delay lines). The buffer could also accept a block to be written to tape, thus freeing the program to proceed while the actual writing proceeded concurrently.

In the late 50’s machines gained the ability to perform input output without requiring the constant attention of the CPU. IBM called this extra hardware a channel and the 709’s had channels. Large blocks of data could be moved between core and the I/O device concurrently with computation. The names have since changed more than the idea.

With the CPU no longer required each microsecond during I/O, techniques were developed to profitably devote the CPU to the main calculation while I/O proceeded apace. On the 709 there were conditional test instructions which would quickly reveal whether some particular I/O operation had finished. The application program would sprinkle these about and respond strategically. This required that the application explicitly arrange for overlapped I/O and calculation. This added somewhat to the cost of application design but was accomplished for production jobs at many computer sites.

The IBM 7090 was a faithful transistorized implementation of the 709. On the 7090 (and some retrofitted 709’s) it was possible for the program to arrange for an interrupt to happen at the completion of some I/O operation. It was 1960 when Elaine Boehm explained to me as follows, IBM’s rationale for adding interrupts to the 709 architecture. I record these ideas from the 1960 perspective—in the jargon of the day, as if they were new. They were new then to most programmers. Certainly others had begun to design for ‘multiprogramming’ and there were scattered earlier successes.

When the I/O finished, the CPU would be executing code unassociated with the I/O, whose purpose was unaffected by the I/O. Nonetheless it was strategic to interrupt that code and commence obeying an interrupt routine designed to respond to the completed I/O. This action was the interrupt and had to be carried out so that the purpose of the interrupted code could be served by continuing its operation later without reference to the design of that code. This was achieved by a combination of hardware features and design of the interrupt routine. The address of the instruction in the interrupted code, that would have been executed next was captured by the hardware and delivered to the interrupt routine so that that routine could save all of the machine state in preparation to later returning to the interrupted program.

Suppose that there is a task that involves many I/O operations interspersed with trivial operations that only the CPU can do. Without interrupts there are only two obvious ways to do this task:

The CPU repeatedly:

starts an I/O operation.
tests repeatedly for the end the I/O operation,
performs its part,

Scatter calls to a poll routine throughout the code of some application. The routine performs its next acts if the I/O is done and returns to the application. There was a tradeoff between polling too often and not often enough.

The drawbacks of these are obvious. The second scheme requires expensive coordination between programming teams as well as additional programming. Computer operations must be aware of which applications can be teamed with which I/O intensive jobs.

With hardware interrupts added to the system design, the instruction stream is immediately diverted to a task that only the CPU can do and which must wait for the I/O completion. The point of diversion, within the application, did not need a programmed test for I/O completion, thus improving both software modularity and hardware efficiency. Here is a note on the history of interrupts.

The 7090 had no semblance of protection and this scheme suggested concurrent agenda within the machine, one to do with a CPU intensive task and another, the I/O intensive work. “Multiprogramming”, the word, had not yet been invented; it was yet a vague goal.

IBM introduced the series 360 in the early 60’s. Those machines had sufficient protection for an operating system to provide an environment for applications so that the concurrent agenda of more than one application could be accommodated in the manner suggested above. Bugs in one application would not impact another nor would it impact the operating system. IBM delivered an operating systems to accomplish this just a few years later.

Demand Paging

In England the Ferranti Atlas was pioneering in another direction. Already in the very early 60’s the Atlas had demand paging which meant that the application logic was oblivious to even the I/O that it was causing in the form of page faults. The result was the illusion of a virtual memory much larger than the real memory. This further modularized and simplified application logic.

The Atlas supported the idea that one program could proceed while another was awaiting the I/O necessary to serve a page fault. Lack of sufficient real memory limited this technique.

This line of development coincided with the early rumblings of Timesharing.

Digging even deeper into hardware details reveals interesting ideas that have emerged in recent computers. The dataflow paradigm calls for a computation to proceed when its input is available. This idea can be applied in a variety of ways. Some processors issue instructions “out-of order”—an instruction is examined early by the hardware but held in abeyance until its inputs are ready even as subsequent instructions are examined and perhaps initiated. The real instructions may thus be obeyed in other than program order. Sometimes the programmer or compiler must cooperate to make this happen. A close variation on this is speculative execution where hardware facilities, otherwise idle, are set to obeying instructions that a previous conditional branch may yet make null and void. Various techniques are used to make such failed computations harmless.

Just recently I heard the comment that the stack was the heart of computer and software design. I responded that I had been programming several years before the stack was invented. That event in part instigated this note. The programs of pre von Neumann machines were not in central memory. I do not know whether any of these machines were able to operate on arrays in memory. The memories then were so small that this might have been infeasible. Placing the program in central memory allowed the program to operate on itself and this allowed the program to operate on an array by modifying those instructions that accessed the array. To sum an array of numbers one would write a loop and in the loop would be an instruction to modify the add command that accessed an element of the array. The address of the array element would be in the add instruction and the instruction would fetch the array element from the array and add the value to the accumulator. Self modifying code is now considered exotic. It was once at the heart of scientific programming.

The Univac I computer had 1000 words of memory which could each hold one number or two instructions. There were about 10 magnetic tape units. Livermore used the machine to do two dimensional hydrodynamics programs with 10 or so numbers per zone, and many zones in both directions. Not even a row would fit in memory. The machine had no index registers and typically one zone with perhaps four to eight neighbors would be in memory at once. After the calculation for a zone for a time step the new results would be written to tape and values for the next zone would be read from tape. In this scheme there was no need to modify the addresses in instructions to access data from arrays. The tapes were buffered so there was overlap between tape I/O and computation.

Compilers were written for the Univac which organized the work to fit with the abilities of magnetic tape storage. I think perhaps those languages did not support arrays. Livermore’s Univac programs were all written in a crude assembler language.

Abusing the Stack

Prolog is a language and interpreter that arranges the order of calculations in an unconventional order that I will not try to describe here. From these implementation ideas came languages such as Joule in which you might imagine a program reading:

for(x = -10; x < 10; ++x) plot(x, sqrt(25 - x*x));

that plots a crude circle, including the negative roots.

The familiar loop construct from C must first be replaced by one that does not so manifestly call out how the set of values for x are to be produced. Thus in Algol-68:
for x from −10 to 10 do plot(x, sqrt(25 − x*x)) od;

Now the imperative seems to be that for 21 values of x, invoke the plot routine. What happens in the sqrt routine when its argument is negative? Negative numbers have no square roots and so none are returned. Indeed the sqrt root routine does not return in such cases. After it notices that there are no roots it goes to code that looks for necessary things to do. When x = −5 the argument is 0 which has just one root which is returned. This causes a point on the circle to be plotted. Between −5 and 5 there are always two roots, both of which are returned, in some irrelevant order. This happens because the sqrt routine notes that there are two responses and stashes the continuation, and other root, in the list of “necessary things to do”.

Conventional semantics would say that the program finishes when it is discovered that there are no more values for x for which the plot routine needs to be invoked and there are no more necessary things to do. The new semantics says merely that the program is finished when there are no more invocations of plot to be performed. If the underlying hardware is a single processor then this would seem to suggest code to maintain a varying set of calculations to be continued. The program finishes when there are no further necessary things to do.

This account of the meaning of the program ignores all question of the order that the steps of the program are executed. A more detailed account in the jargon of continuations is that when the argument of sqrt is 25 the routine gets a continuation, which incidentally leads to the plot routine being invoked. There being two roots this continuation is twice invoked and two points are plotted. Dean Tribble invented Joule and assures me that the above pattern was not what Joule was invented for.

This is a more careful study of the above in Scheme.

Here is a plan to organize many orthogonal software functions with nothing resembling a stack.

Here we learn how to express Prolog programs in Scheme syntax using multi pop continuations.

More familiar than the above are the habits of a multiprogramming OS. Today almost everyone is familiar with non-preemptive multiprogramming where one application can begin before another finishes. An application must occasionally voluntarily surrender the machine in case some other application has urgent work to do. With preemptive multiprogramming, on the other hand, the operating system uses the clock to regain control frequently and possibly switch between applications on some basis such as “fairness”. In either case, each application has exclusive access to its private stack. Of course an application may present more than one “program” or “task” to the multiprogramming facilities. This is indeed the obvious way for compute intensive applications to employ multiple processors.

The Cactus stack combines these ideas as reported here. Implicit but not evident from that note is that tasks, or branches of the stack, could share data on the common shared portion of the stack. This was a natural outcome of the mechanisms used to support lexical scoping.

See Continuously Evaluated Functions.
Transaction oriented in network context
Stream Oriented in networks

See this note on use of exclusive locks. See notes on the Event Loop pattern. See this for a list of negative points people have made about threading. I disagree with the majority, yet it is a collection of important pointers. The ones I have followed are worthwhile. Google’s MapReduce tackles a complex problem in distribution of processing.

Here is Gene Amdahl’s story.

Sour grapes on parallel processing.