Reading Wayne Gramlich’s scaling requirements for storing annotations caused me to recall a scheme that we used 20 years ago to record which documents contained which words. Part of these ideas are expressed in the literature, but there is another part that may be original. Indeed the idea may be better suited to the annotation location problem than to the problem for which it was invented. Don Knuth’s “The Art of Computer Programming” (Searching and Sorting; page 562) has a short description. Perhaps the first published material is by Burton H. Bloom and appeared on page 422 of Vol. 13. #7 of the Communications of the ACM, July 1970 where careful mathematical consideration is given to performance tradeoffs. Bloom filters are well suited for spell checkers.

Abstractly a Bloom filter is a mutable object that maps strings onto bools. Given a string it returns a bool. It is allowed false positives. A creation time parameter trades off the frequency of false positives against storage costs. A bloom filter can also have strings added which causes the filter to respond true to that string in the future. There is no delete operation.

The easiest way to describe the internals of Bloom filters is to describe how they are read. The writing is then obvious. Given a string, its hash is computed. This must be a long very high quality hash. The hash is divided into n (about 10) pieces each of m (about 20) bits. The filter’s state is a string of 2m bits. Each of the n hash pieces is treated as an index into the state string. The filter responds true if all of the accessed bits are one.

Under usual circumstances it is optimal for the density of the state string to be 1/2. Then the probability of a false positive is 2−n. Each new word adds about .7*n bits to the data base. We will call this number (.7*n) the load on the filter.

Neat properties of Bloom Filters

All of the stuff in this section and more is covered in more detail here. Bloom filters have a very remarkable property. Like holograms they are robust in the presence of errors. If memory (disk or RAM) is equipped with parity and a block of memory is damaged, merely substitute all ones for the damaged block. This will slightly increase the frequency of false positives.

Two Bloom filters can be merged by oring them together. The merged filter will respond positively at least whenever either of the inputs would have. The density of the merged filter will be greater than that of the inputs and thus either the inputs or the outputs will have a sub-optimal density.

Several Bloom filters can be ored together offset by various small numbers of bits. Now hits can be related to a particular original filter. If you think this thru you will find that such a merged filter is very efficient to read. Of course the merged filter may be originally constructed as merged.

This elegant application relies on the impossibility of retrieving the strings from the filter.

Application of Filters to Annotation Sets

Suppose that I subscribe to the annotations of p annotators and that each annotator stores his own annotations. I keep an offset filter that returns p bit strings. When I read an URL x, I hash x and get back a string with a one bit for each annotator with something to say about that URL. I fetch those annotations or direct an annotation server which annotators to consult.

Here are the cost and performance considerations for this scheme. For each annotator that annotates an URL, there is a load of .7*n (about 7). Multiple annotations by the same annotator to the same URL do not increase the load. Consulting the offset filter for a new URL requires computing the hash. SHA is a very good hash and costs about 30 instructions per character. Retrieving the p bits requires about another 50 instructions.