There is the real code but here are a few comments. The C code for the transform is merely the 10 line routine “fft”. At the end is a main routine to illustrate invocation of fft and illustrating rounding errors.

You may not like my formatting but the algorithm is obscure with any formatting.

fft provides about 8 bits less precision than provided by the type named “c”. This is due to the way that w and b are calculated. A more complicated way of computing b and w can improve their accuracy or alternatively the computing platform might provide a type for b and w with 12 more bits of precision than needed for the entire calculation. If the type identifier “cx” in this version is bound to that type, the problem is solved. This affects the computation of b, “b = csqrtl(b);”, and w, “w *= b;” even if it is slow. Most critically w needs to carry 9 or 10 more bits of precision than the input to fft so that the 512 iterations of “w *= b;” do not corrupt the calculation. The 10 iterations of “b = csqrtl(b);” could be replaced by a table lookup for any practical application. Again each successive value of b needs one more bit than the previous and the first needs the precision of the input.

In 2006 Nov the Mac OS X csqrtl produced only 52 bits of precision. In 2007 July, it gives 64 bits as it should! Here is a complex square root if you should need it.

Here is experimental code for bit reversal.

Here is some logic for the code. The Fast Fourier Transform is performed by the 10 lines of code comprising the fft routine. The third line conditionally swaps the input array elements at indexes i and j. The 2nd and 4th line arrange to provide all combinations of i and j such that i and j are bit reversals of each other. The three lines thus perform the infamous butterfly permutation in place. Unless the entire array fits comfortably in cache this operation is not cache friendly. Further the 4th line is an inefficient way to compute successive backward counts.

This is a collection of ideas for doing such array permutations more efficiently and in a cache friendly way. Some cache savvy routines do more data movements but are much faster. Using the ideas described there you choose k such that 2k elements of the array occupy no more than ½ of the cache. ...

I should study this.