In most computer systems today it is common to “reset the platform” when a program goes haywire. On most PC’s this still means rebooting the entire machine. On Unix and other larger systems some subcomponents can be “reset” while others continue. In Keykos we were worried about this but it did not turn out to be a significant problem.
Many objects only require short lived state. To compile a source file one would create an instance of a compiler. It was nice but not critical that upon restart compilations would proceed from where they left off. If the compiler failed one could put a capability to broken compiler in mail and send it off to the compiler expert. In the mean time you could still create new compiler instances. This is much like familiar practice except the compiler expert got the program in situ instead of some rigid corse. (2014) The recent Heartbleed bug in SSL is a good example. In Keykos the lifetimes of the various objects associated with a network session never outlasted the session.
Long Lived State is where the benefits and problems of persistent memory arise. In conventional systems, state that must survive system crashes or planned system shutdowns, must be transcribed to the file system and recovered from files upon restart. Upon unplanned shutdown the recovered state may be out of sync with the state of other objects. With persistent memory all of this is unnecessary. Here are a few of the Keykos long lived state and our experience with them:
Bugs in the privileged kernel are a special case. Since the system state has a well defined definition on the disk it is easy to replace the kernel without perceptibly bringing the system down. Of course if the new kernel has bugs the system will most likely crash and revert to an older state and older reliable kernel. Checkpoints are preceded with an extensive sanity check by simple privileged code. It was almost unheard of to take a checkpoint which was corrupted by action of a buggy kernel.
Another relevant observation is that we ran on IBM 370 hardware. About 20% of that hardware was devoted to checking. The CPU itself was heavily checked. RAM included ECC. We did software disk mirroring (RAID) and could thus survive total wipeout of disk drives. I don’t know how reliable current hardware platforms are today. It seems probable to me that most crashes are due to software, although the evidence is poor.
An anecdote is apropos here. One kernel bug was to begin writing the two copies of a duplexed disk block at once. We realized that was a bug when the RAM holding part of the block failed. (I hear that smoke was actually observed in the computer room.) The two channels, that were writing the two blocks onto different disks, both ceased when they needed the data in the failed RAM. Both copies of the critical disk block were corrupted with a partial write. Fortunately we were able in a few hours able to repair the disk state and not fall back to a many hour old tape checkpoint. This required intimate knowledge of the semantics of the failed disk block which would probably not be available in the field. Of course we fixed the kernel bug—don’t start the 2nd write until the first has finished.