Hierarchical Paging Structures

<title>Hierarchical Paging Structures</title>
Section 4.2 of Intel’s “Intel® 64 and IA-32 Architectures Software Developers Manual, Volume 3 (3A & 3B): System Programming Guide”
( <a href=http://www.intel.com/products/processor/manuals/index.htm>here</a>?)
describes the IA-64 interpretation by hardware of the 64 bit ‘linear address’ produced by program logic.
Implementations are presumed to implement k bits of a linear address and then demand that the high (65−k) bits all be the same.
A 64 bit word satisfying this is called ‘canonical’.
Non canonical numbers cause faults when used as addresses.
The hardware assumes that the software has created a tree of mapping tables.
A mapping table is a page (4KB) in size.
Each entry in a table is 8 bytes long and provides the address of either a page or another mapping table.
The privileged register CR3 holds the physical address of the level 0 mapping table which is the root of the mapping table tree.
An entry of a mapping table at level n locates a mapping table of level n+1.
There are roundup((k−12)/9) levels of tables.
The left k−12 bits of the address is divided into 9 bit portions with perhaps the left most portion less than 9 bits.
The portions, from left to right are used to index into the successive levels of the mapping tree.
According to bits found in a mapping table entry, the entry may:<ul>
<li>Be invalid
<li>Provide the physical address of a mapping table at the next level,
<li>Provide the physical address of a page.
</ul>
<hr>
Here is some C code to time storage allocation together with (lazy) zeroing.
Allocating and zeroing a terabyte is as follows:
<pre>gcc <a href=a.c>a.c</a> -Wmost <a href=tbx86.s>tbx86.s</a> -fnested-functions
time ./a.out 46

P=70368744177664
 329992749 malloc:
     40599 0 test:
     32166 0 test:
     29817 0 test:
     28152 0 test:
     29889 0 test:
        54 0 test rep addr:
        63 store in area:
2750230224 free:
 278035515 malloc:
     69075 0 test:
     32598 0 test:
     29520 0 test:
     32643 0 test:
     31554 0 test:
        45 0 test rep addr:
        72 store in area:
2759717952 free:
 282390579 malloc:
     80118 0 test:
     34911 0 test:
     31896 0 test:
     33075 0 test:
     24642 0 test:
        63 0 test rep addr:
        72 store in area:
2768732793 free:
Total clocks = 9169660836

real	0m3.842s
user	0m0.001s
sys	0m3.840s</pre>
The units are processor clock cycles as counted by Intel's ‘Time Stamp Counter</a>.
It is hard to see how this might cause disk IO.
At least some full executions of the program leave both disk read and disk write counts, as reported by the ‘Activity Monitor’, unmodified.
The total cycle count is about 9.6 giga clocks which is about right for the 2.4 GHz CPU.
The program can allocate 2<sup>46</sup> but not 2<sup>47</sup> bytes.
That is about right for a CPU that implements 48 bit virtual addresses.
I have not found an authoritative source of such information for the Core 2 Duo in my Mac.
10 Giga clocks (2<sup>43</sup>) for ‘zeroing’ 64 tera bytes would be impressive if it were not lazy, and thus almost entirely avoided.
2<sup>46</sup> bytes is 2<sup>34</sup> pages requiring 2<sup>37</sup> bytes of page tables.
My Mac has only 2<sup>33</sup> bytes of real memory and so they create page tables lazily too as does Keykos.
It is plausible that they build all the 2<sup>28</sup> bytes of 2nd level page tables.
I think that they don’t because the Activity Monitor shows no memory usage blip.
It is curious that they spend 1.4 Gops freeing memory—nearly 10 times that required to allocate it.
Both the malloc and free times are proportional to the size of allocated storage—down to a tiny 100 GB of allocated storage.
Time for malloc is proportional too.
I suspect that there is an avoidable kernel loop considering some level in the Hierarchy.
<p>In Keykos the segment keeper would need to traverse a tree with 5 full length branches.

Each branch would have 8 nodes plus the root node.
That’s about 40 nodes.
Each slot of each node must be examined because the keeper has no other record of the shape of the tree.
That is 640 slot queries each of which might take about 1000 clocks including the query to see if it is a data key.
Those 40 nodes would be sold back to the space bank and each sale might require about 2000 clock cycles.
That would be about 700,000 clock cycles to sell the segment for scrap.
Reclaiming the page tables is a side effect of reclaiming the segment nodes that produced them—already included.
<p>Creating and deleting a segment keeper is <a href=../../CapTheory/KK/m/72.html#fac-slots>something like 10000 clocks</a>.
Populating the segment would be one tree branch per fault.
That would be about the cost of tearing it down.
These are probably somewhat optimistic estimates.