Impact of Persistent Memory

I am often asked about consequences of some system or application component crashing or looping in a system with persistent memory such as Keykos. If you bring up the system again the component is still crashed or looping. Here is our experience with Keykos which ran for a number of years.

In most computer systems today it is common to “reset the platform” when a program goes haywire. On most PC’s this still means rebooting the entire machine. On Unix and other larger systems some subcomponents can be “reset” while others continue. In Keykos we were worried about this but it did not turn out to be a significant problem.

Many objects only require short lived state. To compile a source file one would create an instance of a compiler. It was nice but not critical that upon restart compilations would proceed from where they left off. If the compiler failed one could put a capability to broken compiler in mail and send it off to the compiler expert. In the mean time you could still create new compiler instances. This is much like familiar practice except the compiler expert got the program in situ instead of some rigid corse. (2014) The recent Heartbleed bug in SSL is a good example. In Keykos the lifetimes of the various objects associated with a network session never outlasted the session.

Long Lived State is where the benefits and problems of persistent memory arise. In conventional systems, state that must survive system crashes or planned system shutdowns, must be transcribed to the file system and recovered from files upon restart. Upon unplanned shutdown the recovered state may be out of sync with the state of other objects. With persistent memory all of this is unnecessary. Here are a few of the Keykos long lived state and our experience with them:

Space Bank: This is the service from which one asks for space. Such space generally comes from a shared pool. If the space bank goofs then the system can become very sick and it may die. The space bank is not an especially simple program for it was required to implement several space policies. The space bank was heavily executed and debugged early. It seems to be several thousand lines of bug free code.
Network Adapter: Most interactive terminals were connected thru Tymnet. There was a single shared physical interface to that network. If the program that multiplexed that interface crashed, it was so much like the network crashing that precautions against the latter solved the former as well.
The Factory: The factory served as the source of new instances and by its definition, kept no state between invocations. Its logic ensured the each new instance began in the same state.
Balanced Tree: This served as a mutable permanently sorted map from strings to capabilities. It was the embodiment of directories a bit in the flavor of Unix directories.

Each of these objects were heavily used upon introduction and, except the Tymnet adapter, bugs were stamped out soon after introduction. The Tymnet adapter had to contend with a poorly specified interface. One reason for the early finding of bugs is that there was no alternative. It was unusual for a bug to hit more than once before being fixed. I have not worked in shop devoted to developing Unix, but in shops that merely use Unix, the style is not to be particularly surprised if the kernel or some deamon dies. The cure is to restart it.

Bugs in the privileged kernel are a special case. Since the system state has a well defined definition on the disk it is easy to replace the kernel without perceptibly bringing the system down. Of course if the new kernel has bugs the system will most likely crash and revert to an older state and older reliable kernel. Checkpoints are preceded with an extensive sanity check by simple privileged code. It was almost unheard of to take a checkpoint which was corrupted by action of a buggy kernel.

Another relevant observation is that we ran on IBM 370 hardware. About 20% of that hardware was devoted to checking. The CPU itself was heavily checked. RAM included ECC. We did software disk mirroring (RAID) and could thus survive total wipeout of disk drives. I don’t know how reliable current hardware platforms are today. It seems probable to me that most crashes are due to software, although the evidence is poor.

An anecdote is apropos here. One kernel bug was to begin writing the two copies of a duplexed disk block at once. We realized that was a bug when the RAM holding part of the block failed. (I hear that smoke was actually observed in the computer room.) The two channels, that were writing the two blocks onto different disks, both ceased when they needed the data in the failed RAM. Both copies of the critical disk block were corrupted with a partial write. Fortunately we were able in a few hours able to repair the disk state and not fall back to a many hour old tape checkpoint. This required intimate knowledge of the semantics of the failed disk block which would probably not be available in the field. Of course we fixed the kernel bug—don’t start the 2nd write until the first has finished.