A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not; i.e. a query returns either “possibly in set” or “definitely not in set”. Elements can be added to the set, but not removed (though this can be addressed with a “counting” filter). The more elements that are added to the set, the larger the probability of false positives.
Bloom proposed the technique for applications where the amount of source data would require an impracticably large hash area in memory if “conventional” error-free hashing techniques were applied. He gave the example of a hyphenation algorithm for a dictionary of 500,000 words, of which 90% could be hyphenated by following simple rules but all the remaining 50,000 words required expensive disk access to retrieve their specific patterns. With unlimited core memory, an error-free hash could be used to eliminate all the unnecessary disk access. But if core memory was insufficient, a smaller hash area could be used to eliminate most of the unnecessary accesses. For example, a hash area only 15% of the error-free size would still eliminate 85% of the disk accesses