Sharing Addresses between Privileged and User Code

The design of memory mapping hardware on several machine architectures seems to presume that the privileged code will reserve to itself a range of virtual addresses which can never be used by user code. I will call this the Reserved Address Plan, RAP. (I use ‘virtual address’ here in the first sense described here.)

Let me first present the best defense I can for RAP before I argue for another conventional plan. I invite other defenses of RAP.

As best I can determine the plan is that certain addresses be permanently reserved to the privileged code and that when a CPU runs in privileged mode, these reserved addresses are used to access both:

the privileged code and
that data used by the privileged code but inaccessible to any user code.

RAP seems to be the only pattern supported directly by the hardware of x86 (called IA32 by Intel) and the 32 bit SPARCs. In these machines a page table entry may make the page accessible to privileged code but not user code. Upon transition to privileged mode by traps or interrupts, the map is unchanged and the privileged pages become accessible by virtue of the CPU’s privileged mode.

Ostensibly this is convenient to the design of the privileged code because it may directly access the user space by the same addresses used by the user and, at the same time, access its own privileged data by their reserved addresses.

I argue below that this convenience is illusory for any kernel that claims to continue correctly after bugs in user mode programs. Such kernels can use RAP, but for those, the ostensible convenience is gone; and further, RAP is prone to security errors. These pitfalls are analogous to the Confused Deputy Problem.

Today, 2018 Jan 2; reports of an Intel flaw throw further shadows over RAP. Speculative execution of user code tries to read addresses marked “kernel only” by the map.

SMAP can ameliorate a significant fractions of the bugs facilitated by RAP but the cost also decreases the perceived the advantages of RAP.

Secure Design with RAP

Frequently the privileged code must read or write data in the user’s space. Usually the user’s program provides the address of this data either in registers at the time of system calls, or in other data structures in user memory. To access this user data, the privileged code may use normal instructions which generate virtual addresses as provided by the user program.

Imagine, however, that an erroneous user mode program provides an address that is reserved according to the RAP. Without an extra programmed check by the privileged code this will result in erroneous access to privileged data. If the nature of the system call is to copy data from user space to user space then the result will be to copy between user space and privileged space where addresses in the privileged space are determined by the user code. To prevent this, every reference by privileged code, to user memory must be accompanied by a programmed check to ensure that the reference is not to a reserved address. A loop to follow a chain thru user space must perform the check for each iteration. Unix does this for some cases that I have checked, but not others. Unchecked accesses are likely to be exploitable as serious security flaws. They are not found by testing correct user code. This is why such systems are prone to security flaws.

The Intel 386 had a closely related design flaw in its memory map.

There is yet another possible vulnerability as the kernel accesses data from the user’s memory. Kernel code that examines some data byte in memory via more than one fetch from memory may assume that each fetch finds the same value. Since that is normally assured for benign environments this is a convenient and likely pattern to be found in kernels. This reports an exploit of this vulnerability. In that exploit the time separation of the two fetches was significant and fetches closer in time would be harder to exploit. Yet the perils of not copying data across protection domains seems real.

An Alternative Scheme

Many systems can execute privileged code with a map different from the user, or no map at all. If no addresses are to be reserved then the system must switch maps simultaneously with switching to privileged mode. Some systems are able to turn off the map or switch to another upon a trap or interrupt. The IBM 370 enables mapping independently for privileged and unprivileged modes. The PowerPC reverts to unmapped upon a trap. Such hardware should provide some facility to ease access by privileged code to user space.

The 370 provides the Load Real Address instruction that turns a user provided address into one that unmapped code can use to access user data. When operands cross page boundaries, the LRA hack is painful.
The PowerPC has privileged commands to turn the map on or off. The map for instruction fetching remains off.

In both of these schemes special code, unlikely to be produced by compilers, is necessary for privileged code to access user space. We have seen, however, that treating such access as an ordinary memory reference is likely to be insecure. In both the PowerPC and 370 scheme, bugs in access to user space will be caught while testing enough correct user code. It would seem that testing kernel releases against hostile user code is not done often enough.

Linus Torvalds addresses these problems and others with this scheme. Unfortunately it is not part of the normal debug cycle.

A Kludge

It might be argued that machines that do not change the map upon interrupt can instead merely change shortly thereafter. Already however, those first few privileged instructions after the trap must be fetched from some address and those addresses must be reserved.

One plan that seems feasible on some such machines is to reserve different addresses at different instants and thus present the illusion to the user code that no addresses are reserved. Perhaps there is just one page of privileged code that is mapped into user space (without user access) in order that that code switch to another map private to the kernel. This physical page can be mapped at different addresses at different times. The page contains address free code. The address of that page within a user’s space would indeed be reserved but when the user code uses those addresses the virtual address of that page is changed. The IA32 segment registers can be used to support this trick efficiently.

The Virtual Machine

One motivation, aside from elegance, not to reserve addresses is in the construction of virtual machines. A program designed to run in privileged mode is unlikely easily to conform to pre-existing reservations.

How Linux copes

The standard Linux system calls each take a few arguments some of which may be addresses of data structures in the caller’s memory which the kernel must read or write. (The list can be produced by the shell command “man 2 intro”. See Linux sources.) In each case extra kernel code is required to verify that the entire structure is within the user’s space, lest the kernel inadvertently access data by abusing its extra authority over that of the caller.

Almost every system call takes an address of some string or structure in the user’s memory and must either read or modify that location as part of its function.

See this on space checks. Linus explains all. As far as user addresses go I think it would have been OK to type them as long ints. That way it wouldn’t be easy to dereference them. C (and many other languages) need distinct types that are all storage equivalent. For instance you could declare floating point inches or floating point centimeters and the compiler would notice unit flaws. Ada and Euclid had this.

Apple explains these issues for the OS X kernel.

Just now (2006 May 10) a Slashdot article says that Torvalds has written a note on the micro kernel issue and reportedly says that these issues are critical to the issue. Some of the comments are interesting. I agree with those things that he says and that I understand. I don’t know what he means by ‘micro kernel’ and so I cannot understand his objections to them. It depends somewhat on hardware architecture.

A good peek into Linux kernel logic
KAISER is a patch to several x86 kernels to remove almost all kernel code from user space. What remains is constamt and holds no secrets. KAISER thwarts Meltdown.