“Exponential Permutations” is a strange title but I have seen several applications where it was necessary to permute an array of 2n elements where each element at index i was to move to a new index j whose bits were a fixed permutation of the bits of i. Thus a permutation on n index bits was to define a permutation of 2n array elements. Thus the name “Exponential Permutations”.

Just as permutation can always be decomposed into elementary permutations (by various definitions of “elementary”), the corresponding exponential permutations can likewise be decomposed. Such a pattern may not always lead to the most efficient plan for the large permutation, but it may lead to a general and easy to understand program. Many exponential permutations corresponding to elementary permutations are much more cache efficient than the general direct exponential permutations and thus the decomposed version may be faster even when more instructions are issued.

Transposing bits on Harvest

I first recall this pattern in a program for the Harvest computer. The task was to transpose an array of 210 by 210 bits. (Addressing on Harvest was to the bit.) A 20 bit integer served as index into the array. The elementary permutation we chose rotated the 20 bits of the array index left by one, and repeated that permutation 10 times. This swapped the left 10 bits of the index with the right 10 bits, thus performing the desired transposition. Each rotation of the index by one was accomplished by passing over the array of 220 bits. This transformation on the 220 bits was fast for the Harvest because the Harvest had special table driven stream processing which could be programmed to merge two bit streams into one with alternate bits from the two inputs.
for(j=0; j<(1<<16); ++j) b[j] = table[c[j]][c[j+(1<<16)]];
could be performed without the instruction counter. b and table are arrays of 16 bit values and c is a previous generation of the data viewed as an array of 8 bit bytes. This constituted the perfect shuffle. After 10 identical passes the permutation was done.

Harvest was big endian and also numbered and addressed bits from the left. The basic transformation took 8 bit bytes b[j] and b[j+65536]; and, labeling their respective bits abcdefgh and ijklmnop, produced aibjckdlemfngohp by a table lookup using those two input bytes as table indexes. This 16 bit value was placed in the output array at a[2*j]. This was iterated for 0≤j<65536. The above “round” did not require the program counter and was repeated 10 times. Bit 0 of byte 0 stayed put for each of the 10 rounds. Quoting bit addresses in hex Bit 1 of the array moved on successive rounds to hex bit addresses 2, 4, 8, 10, 20, 40, 80, 100, 200, 400. Bit 8 moved as follows 10, 20, 40, 80, 100, 200, 400, 800, 1000. Meanwhile the bit starting at 1000 progressed to 2000, 4000, 8000, 10000, 20000, 40000, 80000, 1, 2, 4, 8. Reviewing the array as a square bit array, bit b[0,0] stayed put, b[0,1] went to b[1,0] and b[0,8] exchanged with b[8,0]. The array was thus transposed. The hardware accessed the variable data by 64 bit words and the table lookups from a small fast memory by 16 bit hunks.

I am not sure that I would have been able to think thru this if bits were numbered in a byte from right to left and bytes in a string from left to right. My mental model was that of an array of bits and the hardware was documented and programmed this way.

While each bit traveled between core memory and the stream processor 20 times, it traveled each time with 63 other bits that were each also on a profitable trip. The naïve programs would have required 220 fetches or stores of isolated bits. The optimized program required 10*2*220/64 fetches and stores. Alan Karp makes and extends these performance observations in his note: Bit Reversal on Uniprocessors.

Image transposing in SmallTalk

Peter Deutsch selected a different elementary operation to transpose a square array of pixels. His elementary permutations swapped the bits of the array index, left and right, one at a time. The intermediate images were amusing. The nature of BitBlt made this efficient for the SmallTalk world.

Yet another instance of this pattern is the index bit reversal portion of the Fast Fourier Transform. The required permutation of array indexes, is to reverse the order of the bits. I suspect that the fastest way to do this does not fit in the exponential pattern. The exponential version, with caches, may be more efficient than the fancy schemes that require fewer instructions, and it is easier to understand.

Connections with other fields

These are specialized permutation algorithms. Sorts permute data too but the permutation is data dependent. Data independent permutation algorithms can be derived from sort algorithms by imagining a key field accompanying each item. Transposing the bits on the Harvest moved the data in the same pattern as von Neumann’s binary merge version of sort. It achieves the same cache efficiency as the sort did with magnetic tapes. Alan Karp tells me of a algorithm like the Harvest scheme but in the exact opposite order: taking 01234567 to 02461357. This would come from the binary pocket sort or radix sort.

The Harvest could do pocket sorts too but less efficiently. It could maintain just one efficient output stream but two efficient input streams.

There are “block algorithms” for manipulating matrices that pass over the matrices by sub-blocks instead of the naïve row and column order. This is a related idea which achieves related memory bandwidth benefits.

Data base query optimization resorts to such patterns.

Related Idea

Suppose we have transmitted data between machines with different endianness. We want to reverse the order of 2n bytes. One can reduce this to reversing the eight bytes of a 64 bit operand.
typedef long unsigned int L;
…
L rb(L x){L a = x<<32 | x>>32,
   b = ((a&0xffff0000ffff0000L)>>16) | 0xffff0000ffff0000L&(a<<16);
   return ((b&0xff00ff00ff00ff00L)>>8) | 0xff00ff00ff00ff00L&(b<<8);}
This can clearly be extended to nibble or bit reversal. Here the address bits are complemented, not moved.