Ever since I first read about them, bloom filters were one of those algorithms that seemed really interesting to use. However I’d never found a use for them - until a few months back.
Bloom filters are a space and (often) time efficient data structure. However they’re probablistic; they can be wrong.
So what do they actually do? Sometimes you need to know if you’ve seen a piece of data before. A bloom filter is a data structure to do that check. You can add strings to it and you can ask it if that string has been added.
An example I’ve seen in the past is a url router. Say you have a thousand machines and each knows about 1 million random urls. So you go ask a router, “Which of these thousand of machines should I ask a url?” If each machine can register a bloom filter with the router (a bloom filter that can hold that number of urls would be a few megs), the router can just check that to see who to ask.
There’s a small chance the bloom filter will be wrong and the url server will just reply, “never heard of it.” But it will be far faster than asking all 1,000 servers and will use far less space than a full url index on the router.
But I’ve never needed to write that and no other use case seemed apparent. I’m working on a mail appliance now though, so two use cases appeared - greylisting and rate control. And in the course of researching using bloom filters in this area I discovered I’m not the only one who hit on that idea.
In testing with various data sets I saw rather huge speedups and far lower data storage requirements. In this case it’s acting as a gatekeeper to more expensive database queries.