In most computer systems today it is common to “reset the platform” when a program goes haywire. On most PC’s this still means rebooting the entire machine. On Unix and other larger systems some subcomponent can be “reset”. In Keykos we were worried about this but it did not turn out to be a significant problem.
Many objects only require short lived state. To compile a source file one would create an instance of a compiler. It was nice but not critical that upon restart compilations would proceed from where they left off. If the compiler failed one could put a capability to broken compiler in mail and send it off to the compiler expert. In the mean time you could still create new compiler instances. This is much like familiar practice except the compiler expert got the program in situ instead of some immutable snapshot.
Long Lived State is where the benefits and problems of persistent memory arise. In conventional systems, state that must survive system crashes or planned system shutdowns, must be transcribed to the file system and recovered from files upon restart. Upon unplanned shutdown the recovered state may be out of sync with the state of other objects. With persistent memory all of this is unnecessary. Here are a few of the Keykos long lived state and our experience with them:
Bugs in the privileged kernel are a special case. Since the system state had a well defined definition on the disk it is easy to replace the kernel without perceptibly bringing the system down. Of course if the new kernel has bugs the system will most likely crash and revert to an older state and older reliable kernel. Checkpoints are preceded with an extensive sanity check by simple privileged code. It was almost unheard of to take a checkpoint which was corrupted by action of a buggy kernel.
Another relevant observation is that we ran on IBM 370 hardware. About 20% of that hardware was devoted to checking. The CPU itself was heavily checked. We did software disk mirroring (RAID) and could thus survive total wipeout of disk drives. I don’t know how reliable current hardware platforms are today. It seems probable to me that most crashes are due to software, although the evidence is poor.