Keykos, Persistence and the Single Level Store

Multics pioneered the single level store, I think. Only the kernel was aware of the respective roles of Core (RAM of the day) and disk. The core served as a cache for the disk. Virtual access to files was the only file access method provided by the kernel. I don’t recall if there were steps that an application could take to insure that data was securely on the disk. The operator could cause an orderly shutdown so that the world state would have all of the most recent page states. When Multics was shut down or crashed, the state of the running processes was lost. Each time Multics booted, it restarted processes much as Unix does today.

Persistence

By contrast Keykos adopts the single level store as in Multics, but preserves the processes as well. Checkpoints are taken every few minutes and the system can restart from these in the time necessary to read them in. The checkpoint includes process states. The architectural strategy that makes this easy is to keep all process state in pages and nodes, which have official status on the disk. All state lives in space bought by some user.

In conventional systems it is commonplace to restart the system when some apparently transient event has corrupted its state. Whether such events are indeed software bugs or in fact transient hardware errors, restarting the system recovers from the great majority of such events. With a persistent system there is a danger that the checkpoint is corrupted. The kernel maintains many invariants in the computer science sense of the word. These are documented in the kernel logic manual and also in kernel code called “CHECK” which runs just before each checkpoint is taken. This usually provides an opportunity to recover an uncorrupted state in those cases where the transient error occurred while in privileged mode. Certain very critical user mode programs, such as the space bank, have run for several years without suffering a fault. The bank code is not especially large and seems to be bug free. The kernel has run for about a year between crashes. The IBM/370 series seemed to be somewhat more reliable yet.

See this about more ramifications of persistence and this about the time warp at the interface to the real world. Other notes on persistence;
Story of 1017 operation calculation