An inner loop that was once the purpose of a collection of many multi-million dollar computers is as follows
{int n=9000; while(n--) ++hist[d[n]];}
A slightly more general loop is
{int n=9000; while(n--) hist[d[n].a] ∴= d[n].v;}
where ‘∴’ is some associative operator such as ‘+’, ‘−’, ‘|’, ‘&’, etc. (Here I mean ((x ∴ y) ∴ z) = ((x ∴ z) ∴ y) by ‘associative’.) Another variation is the following routine:
void f(int k){int n=9000; while(n--) ++hist[d[n] ∴ k];}

Programming this for a RISC machine is fraught with dilemmas. The logic that a correct compiler will stumble over is the alias problem in computing the address of ‘hist[d[n]]’. It can not move loads and stores due to the possibility of an alias. Neither can the programmer generally assure the compiler that there will be no aliases. Sometimes the action lost due to an alias is unimportant; other times it is vital. This means that loads cannot be hoisted above stores and the program runs much slower than it seems it ought.

Here is an extension to an architectural hack that is like a rumored feature in the Transmeta hardware whose purpose was to emulate x86, and not otherwise solve application problems. The feature is an instruction that traps when two distinct registers, specified by the instruction, hold the same value. If trap overhead is sufficiently low then the trap routine can fix things and continue. If overhead is too bad for this then hardware can, upon equality, instead of trapping, emit an element into a stream of index values to be processed at the end of the loop.

The stream would have an element for each value of ‘d[n]’ where the pipelined instruction stream lost the effect of an update due to a alias. If the stream were put into an array like d, then the loop could be repeated for this array—recursively. This will converge rapidly.

I emphasize that this is neither an emulation problem nor a language problem; it is an application problem. The application that I am familiar with was an assembler program.

A possible trade off is for the hardware to collect a short stream, say 4 elements, and then trap. This amortizes the trap cost over more alias events. I do not expect the compiler to notice this all by itself and to compile the above code using this hack. This is a hack that lets the hardware do the application efficiently. I don’t have a proposed way just now to express this in C or any other language.

Some machines may want to have more than two values of hist in play at once. If you have n index values in play at once then n−1 alias compares are necessary. I have not worked out the details.

Harvest’s ‘memory distribution mode’ addressed a somewhat similar problem with a much different solution.

Doing without platform support

The following C code will probably produce faster code for RISC:

{int r1=3000; int* p = d; while(r1--){int *
  r2 = *p,
  r3 = *(p+1),
  r4 = *(p+2);
  if(r2==r3 || r2==r4 || r3==r4) {++*r2; ++*r3; ++*r4;} else {
    int r5 = *r2, r6 = *r3, r7 = *r4;
    ++r5; ++r6; ++r7;
    *r2 = r5; *r3 = r6; *r4 = r7;}
  p += 3;}}
I assume that the y in if(p) x; else y; is inline. The hack may not be worth it.

Better (or Simpler) Yet

Here is the simplest version yet of this idea. I should have started here.

{int r1=3000; int* p = d; while(r1--){
  int * r2 = *p, r3 = *(p+1);
  // r2 & r3 hold addresses of 2 cells to be incremented.
  // Perhaps r2=r3.
  *p = *r2+1;
  *(p+1) = *r3 + (r2==r3?2:1);
  p += 2;}}