I take this paper as the official introduction to Google’s MapReduce system. Below I mean by bag an unordered ‘set’ of values, but allowing repeats of values.

They describe the function map as a function provided by the user that takes a key-value pair and returns a bag of key-values. The role of the ‘argument’ parameter to map is unclear to me. Is the map routine required to pay attention to its value? Perhaps the unstated assumption is that the ultimate purpose is to compute something that depends on a ‘bag’ of a great many pre-existing key-value pairs. I cannot imagine how the type of the bag elements bears on the logic of MapReduce. Perhaps the key of each output pair of map must be the same as the key of its argument.

The map function in their first sample code in section 2.1 indeed ignores its key argument. So far the paper has not described how to install these two functions into the hardware which must somehow determine the arguments to map. The sample implies that MapReduce does decimal arithmetic on the intermediate values. That seems bizarre.

Ah! We come to the mapreduce specification object where we learn that MapReduce assumes that map is taking its input from files.

I have seen no use of the key argument to map in any of the tutorials.


I think that the magic is to distribute code so as to near the data, which today is most likely on many disks each attached directly to computers that can execute such code. Then route the results as intermediate keyed packets via topology savvy mechanisms to places where packets with the same key come together for subsequent application processing. A final typically insignificant stage collects the yields of this subsequent processing. That it can transparently hide hardware failures is an extra strategic advantage.
MapReduce is a peculiar sort of magic. It cannot be adequately described as a normal software abstraction device. It cuts across abstractions in an unaccustomed manner. That is why it can be so effective. An implementation that gains some of the possible advantages would be deep in bed with packet routing decisions where most of the performance advantages will arise for some applications. I suspect that MapReduce optimizes data transmission within the data center for most applications where the cost of such transmission is an issue. MapReduce is a framework in which to efficiently map a significant number of applications on to the massive computer hardware that we know how to build today.