145

I am reading up on Bloom filters and they just seem silly. Anything you can accomplish with a bloom filter, you could accomplish in less space, more efficiently, using a single hash function rather than multiple, or that's what it seems. Why would you use a bloom filter and how is it useful?

Mark Amery
  • 143,130
  • 81
  • 406
  • 459
headache
  • 8,377
  • 5
  • 18
  • 6
  • 6
    have you read the wikipedia article? It explains the advantages pretty well. http://en.wikipedia.org/wiki/Bloom_filter – Alex Budovski Nov 26 '10 at 03:56
  • @david that seems unlikely, though. k hash functions in a constant space will have many more collisions than a single hash function in a constant space. – headache Nov 26 '10 at 03:57
  • 1
    @Alex I have read the wikipedia article. I understand what is said there, but I don't understand why it is at all better. Why it works is intuitive. Why it is useful is not. – headache Nov 26 '10 at 03:58
  • This writer does a great job with it http://www.michaelnielsen.org/ddi/why-bloom-filters-work-the-way-they-do/ – dranxo Mar 02 '14 at 20:54
  • 2
    @dranxo, The linked article http://www.jasondavies.com/bloomfilter/ is better. – Pacerier Feb 09 '15 at 04:33

8 Answers8

212

Alex has explained it pretty well. For those who still did not get quite a grasp on it, hopefully this example will help you understand:

Lets say I work for Google, in the Chrome team, and I want to add a feature to the browser which notifies the user if the url he has entered is a malicious URL. So I have a dataset of about 1 million malicious URLs, the size of this file being around 25MB. Since the size is quite big, (big in comparison to the size of the browser itself), I store this data on a remote server.

Case 1 : I use a hash function with a hash table. I decide on an efficient hashing function, and run all the 1 million urls through the hashing function to get hash keys. I then make a hash table (an array), where the hash key would give me the index to place that URL. So now once I have hashed and filled the hashing table, I check its size. I have stored all 1 million URLs in the hash table along with their keys. So the size is at least 25 MB. This hash table, due to its size will be stored on a remote server. When a user comes along and enters a URL in the address bar, I need to check if it malicious. Thus I run the URL through the hash function (the browser itself can do this) and I get a hash key for that URL. I now have to make a request to my remote server with that hash key, to check the if the particular URL in my hash table with that particular key, is the same as what the user has entered. If yes then it is malicious and if no then it is not malicious. Thus every time the user enters a URL, a request to the remote server has to be made to check if it is a malicious URL. This would take a lot of time and thus make my browser slow.

Case 2 : I use a bloom filter. The entire list of 1 million URLs are run through the bloom filter using multiple hash functions and the respective positions are marked as 1, in a huge array of 0s. Let's say we want a false positive rate of 1%, using a bloom filter calculator (http://hur.st/bloomfilter?n=1000000&p=0.01) , we get the size of the bloom filter required as only 1.13 MB. This small size is expected as, even though the size of the array is huge, we are only storing 1s or 0s and not the URLs as in case of the hash table.This array can be treated as a bit array. That is, since we have only two values 1 and 0, we can set individual bits instead of bytes. This would reduce the space taken by 8 times. This 1.13 MB bloom filter, due to its small size, can be stored in the web browser itself !! Thus when a user comes along and enters a URL, we simply apply the required hash functions (in the browser itself), and check all the positions in the bloom filter (which is stored in the browser). A value of 0 in any of the positions tells us that this URL is DEFINITELY NOT in the list of malicious URLs and the user can proceed freely. Thus we did not make a call to the server and hence saved time. A value of 1 tells us that the URL MIGHT be in the list of malicious URLs. In these cases we make a call to the remote server and over there we can use some other hash function with some hash table as in the first case to retrieve and check if the URL is actually present. Since most of the times, a URL is not likely to be a malicious one, the small bloom filter in the browser figures that out and hence saves time by avoiding calls to the remote server. Only in some cases, if the bloom filter tells us that the URL MIGHT be malicious, only in those cases we make a call to the server. That 'MIGHT' is 99% right.

So by using a small bloom filter in the browser, we have saved a lot of time as we do not need to make server calls for every URL entered.

We can see that hash table with a single hash function is used for a different purpose altogether than a bloom filter. Hopefully this clears your doubts :)

edit:

I have implemented a bloom filter for the task of malicious URL testing in Python. The code can be found here - https://github.com/tarunsharma1/Bloom-Filter The code is very simple to understand and a detailed description is provided in the readme file.

Community
  • 1
  • 1
Tarun
  • 2,579
  • 1
  • 13
  • 7
  • 8
    Thanks for a use case scenario. – Squiggs. Jun 30 '15 at 18:45
  • 1
    I didn't get the part of hashing and associating a value of 0 or 1. If we are using an array, and store 0 and 1 in those, how do we look up for hash value of a url when we are performing the test? – divinedragon Aug 12 '15 at 03:35
  • 2
    So basically we use something called a hash function..which takes in the URL as a string..and gives out a number..we use this number and set the corresponding array index value to 1. A number of different hashing functions exist, but what is important is that every time the same URL is passed through a hashing function, it has to generate the same number. An example of a hashing function could be to add up the ascii values of all the characters in a URL. In bloom filters we use many hashing functions and set all those array index values to 1. Hope this cleared your doubt. – Tarun Aug 16 '15 at 05:37
  • 1
    A conventional hashtable such as C# `HashSet` will use 16 bytes per element element in the best-case scenario in which the hashtable is completely full: 4 bytes map from a "bucket" to an entry in an entries table (an array-packed singly-linked list), 4 bytes for the cached hashcode, 4 bytes for the "next" pointer, 4 bytes for a pointer to the key. And that's not counting the string sizes. In the worst case it's 40 bytes: half the entries are unused and 20 bytes per entry once the `String` pointer expands to 8 bytes for 64-bit architectures. – Qwertie Oct 31 '17 at 17:05
  • 1
    You don't have to save the String itself in the hash set. You can save the hash of it as the value, making the hashset much smaller. Then you can play with the hash size - the bigger it is, the smaller false positive rate will be. – user1028741 May 29 '19 at 13:41
  • 1
    25MB is big to load into the browser? That seems like such an overkill to implement a bloom filter – Michael P Feb 09 '23 at 19:28
178

From Wikipedia:

Bloom filters have a strong space advantage over other data structures for representing sets, such as self-balancing binary search trees, tries, hash tables, or simple arrays or linked lists of the entries. Most of these require storing at least the data items themselves, which can require anywhere from a small number of bits, for small integers, to an arbitrary number of bits, such as for strings (tries are an exception, since they can share storage between elements with equal prefixes). Linked structures incur an additional linear space overhead for pointers. A Bloom filter with 1% error and an optimal value of k, on the other hand, requires only about 9.6 bits per element — regardless of the size of the elements. This advantage comes partly from its compactness, inherited from arrays, and partly from its probabilistic nature. If a 1% false positive rate seems too high, each time we add about 4.8 bits per element we decrease it by ten times.

Pretty clear to me.

A bloom filter doesn't store the elements themselves, this is the crucial point. You don't use a bloom filter to test if an element is present, you use it to test whether it's certainly not present, since it guarantees no false negatives. This lets you not do extra work for elements that don't exist in a set (such as disk IO to look them up).

And all in significantly less space than something like a hash table (which is likely going to be partially on disk for large data sets). Though you may use a bloom filter in conjunction with a structure like a hash table, once you're certain the element has a chance of being present.

So an example usage pattern might be:

You've got a lot of data, on disk -- you decide on what error bound you want (e.g. 1%), that prescribes the value of m. Then the optimal k is determined (from the formula given in the article). You populate your filter from this disk-bound data once.

Now you have the filter in RAM. When you need to process some element, you query your filter to see if it stands a chance of existing in your data set. If it doesn't, no extra work is done. No disk reads, etc. (Which you would have to do if it were a hash or tree, etc).

Otherwise, if the filter says "Yes, it's in there", there's a 1% chance that it's wrong, so you do the necessary work to find out. 99% of the time, it really will be there, so the work was not for naught.

Alex Budovski
  • 17,947
  • 6
  • 53
  • 58
  • 3
    If it's clear, please answer. How could this possibly be more space efficient than a single hash function on the same sized set? This will simply create more collisions. You will bounce around searching on separate hash functions to ensure you have a 1 at all of the hash functions. I don't understand it's advantage over using a single hash function. – headache Nov 26 '10 at 04:02
  • 23
    A hash function is code, not data. What do you intend to use the hash function with? A hash table? In that case, your table will need to store keys, which could be of arbitrary size, unlike a bloom filter. The excerpt mentions this. – Alex Budovski Nov 26 '10 at 04:04
  • 3
    Consider a bloom filter with only one hash function, rather than k. What is the advantage to adding more hash functions? This will simply create more collisions. Or am I wrong? – headache Nov 26 '10 at 04:07
  • 2
    That is answered by the last paragraph in "Space and time advantages" in the Wikipedia article, and the "Probability of false positives" section. – Alex Budovski Nov 26 '10 at 04:12
  • 1
    @headache: but you generally won't get collisions across all the hash functions. – Michael Burr Nov 26 '10 at 04:13
  • 4
    It just clicked. Thank you so much, this has bugged me for a while. It decreases the number of false positives because a false positive would need to either a) be a collision on all of your hash functions or b) all of the spaces have been filled by other values. Choosing the size must be a tricky process then, I guess. Correct me if I'm wrong, but I think I get it. Thanks everybody. – headache Nov 26 '10 at 04:20
  • @headache, "Choosing the size must be a tricky process then, I guess" -- not really, the optimal formula is already given in the Wikipedia article for both *k* and *m* in terms of *n*. – Alex Budovski Nov 26 '10 at 04:58
  • @AlexBudovski, In your 2nd-last paragraph, you stated that we "would have to do [disk-reads] if it were a hash". However, why do we need a diskread if `hash(item)` doesn't even match any key in the hashtable? – Pacerier Aug 14 '14 at 15:22
  • @Pacerier I'm going to take a stab. If your hash table is stored on disk (like, say, a k/v database that uses mmap) you will need to hit disk in order to even see if the key exists at all. With a bloom filter, you load a smaller set of information about your elements into memory and do the look up with that. Am I correct, Alex? – Kyle Aug 18 '14 at 02:39
  • @Kyle, If we had a hash table with 65536 buckets, we only need 65536 bits—(that's a mere 64 kb of ram, how could a bloom filter beat that?)—to keep track of all the buckets' exist-or-not state. It's as simple as `boolean bucket_exist = state[hash(item, 0, 65536)]` which requires exactly zero disk operations. – Pacerier Aug 18 '14 at 04:57
  • One of the best explanations of why need a bloom filter. Great answer! – seeker Sep 12 '14 at 05:08
  • @Pacerier That's one collision for every 65536 values - which is pretty non-trivial error. It's correct though that the values don't need to be stored in the hash-table, just a boolean field is fine. – user3467349 Feb 06 '15 at 06:34
  • @user3467349, An error of 1/65536 is **655.36 times less** than the 1% error rate: What do you mean by "*pretty non-trivial*"? – Pacerier May 06 '17 at 13:30
  • 1
    @headache, AlexBudovski. Re "*You don't use a bloom filter to test if an element is present, you use it to test whether it's certainly not present*"... Note that this is exactly the same as a Hash table. So really, I have never understood this: **How is a bloom-filter "table" any different from a multi-hash "table"?** Aren't they exactly the same thing? – Pacerier May 06 '17 at 13:31
31

I will start with the explanation of what is a bloom filter, what it can and can't do, why do we need it, show an intuitive description how it works and then give some example when they can be useful.

So a standard bloom filter is a probabilistic data structure that can*:


  • add element to a set
  • check if an element is in the set by telling definitely not in the set or possibly in the set

This possibly in the set is exactly why it is called probabilistic. Using smart words it means that false positive are possible (there can be cases where it falsely thinks that the element is positive) but false negative are impossible.

But it can't *:

  • remove an item from the set
  • give you a list of all elements that are currently in your set

*This set of can/can't is for a basic bloom filter. Because it is a useful data structure that was created long time ago, people found how to augment it with other useful features.


But wait a minute: we already know a data structure that can answer all of this without vague 'possible' and also without all of the limitations (can't remove, can't show all). And it is called a set. And here comes a main advantage of a bloom filter: it is space efficient and space constant.

This means that it does not matter how many elements do we store there, the space will be the same. Yes a bloom filter with 10^6 elements (useless bloom filter) will take the same amount of space as a bloom filter with 10^20 elements and the same space as bloom filter with 0 elements. So how much space will it take? It is up to you to decide (but there is a trade of: the more elements you have the more uncertain you are with you possible in the set answer.

Another cool thing is that it is space constant. When you save the data to a set, you have to actually save this data. So if you store this long string in the set you have to at least use 27 bytes of space. But for a 1% error and an optimal value of k **, you will need ~ 9.6 bits ( < 2 bytes) per any element (whether it is a short int or a huge wall of text).

Another property is that all the operations are taking constant time, which is absolutely not the same as amortized constant time in case of sets (remember that if the set has collisions, it can deteriorate in O(n) time).

**k is a value of hash functions used in the bloom filter


I will not describe how the bloom filters work (wikipedia article does a very good job explaining everything). Here I will just briefly tell the basics.

  • you initiate an empty bit array of length m
  • you select k different hash functions (the more independent the better)
  • if you want to add element, you calculate all the k hashes of this value and set the corresponding bits to 1
  • if you want to check if element exist, you also calculate all k hashes and if at least one of them is not set, it is surely is not in the set. Otherwise it can be in the set.

Even this description is enough to understand why we can't be sure (you can get all bits set from various other values). Here is a very nice visualization of how it works.

enter image description here


So when can bloom filters be useful? The short answer is everywhere where false positive are acceptable and where you would want to check if something is in the set, but even if they are not, it can be a first line of defense to rule out expensive calls to verifiers.

Here is a list of more concrete descriptions:

  • a standard example of malicious websites and a browser is described in almost any place where people talk about bloom filters
  • is a passwords weak: instead of having a huge set of all possible weak passwords, you can just check whether the password is surely not weak with a way smaller bloom filter
  • if you have a list of articles and a list of users, you can use bloom filter to show users' articles they have not read. Interesting thing is that you can have only one filter (you check whether the combination of user_id + article_id is there)
  • bitcoin uses bloom filter for wallet synchronization
  • Akamai's web servers use Bloom filters to prevent "one-hit-wonders" from being stored in its disk caches. One-hit-wonders are web objects requested by users just once, something that Akamai found applied to nearly three-quarters of their caching infrastructure. Using a Bloom filter to detect the second request for a web object and caching that object only on its second request prevents one-hit wonders from entering the disk cache, significantly reducing disk workload and increasing disk cache hit rates (taken from examples in bloom's filter article at wiki)
Community
  • 1
  • 1
Salvador Dali
  • 214,103
  • 147
  • 703
  • 753
14

Bloom filters are quite useful in bioinformatics. They can be more space efficient compared to using a regular hash, especially when the size of the strings you are working with can be hundreds of millions of letters with a very small alphabet ie {A,G,T,C} . They are usually used to assess whether a certain k-mer is present or absence in a genome. There's an example of one used for something relevant here.

EDIT:

The multiple hash functions are used to minimize false positives. The hope is that between all of the k-hash functions each value will have a unique signature in the bit-array compared to every other possible value. However, false positives do exist, but they can be minimized to a manageable level. Using this technique you hash elements independently of their size. When you search for them, you use each hash function and check to make sure their bit-values are all 1.

Compare this to the human genome, where an increase in the size of the element increases the size of the hash table significantly (The table size is 4*4k). This is assuming you encode the the elements using 2 bits / letter.

GWW
  • 43,129
  • 11
  • 115
  • 108
  • 2
    Sorry, maybe I am misunderstanding but how can they be more space efficient compared to a regular hash? The hash of a string is fixed length output, and you simply set that value to 0 or 1. This is also what bloom filters would do, but bloom filters would do it on *multiple* hash functions. Where am I misunderstanding? – headache Nov 26 '10 at 03:50
  • 1
    It's not much use to just store a single hash. Then it would have no way of handling hash collisions. Most hash table implementations have a way of dealing with this that incurs overhead. Python dictionaries for example store the key alongside the hash and start linearly probing upon collision. The bloom filter cuts that out and tries to minimize the damage inherent to doing so by using multiple hashes. – Bret Fontecchio Jun 27 '14 at 17:00
  • 2
    Why not create a bloom filter but with just one hash function? maybe "relatively large" hash function. But one instead of many – Giorgi Moniava Aug 08 '16 at 13:00
  • @BretFontecchio Ah! I think this is the explanation that did it for me! I've never noticed that the keys in a hash table are ONLY there for collisions/resizing. So it's kind of like: bloom filters and hash tables are solving the same problem but with different approaches to collision-handling. – Mike B Oct 05 '22 at 04:00
8

If a Bloom filter returns that an item is member of the set, there's a certain probability for a false positive. If only a single hash function were used to indicate membership in the set, the probability of a false positive would be higher than using multiple hash functions.

Michael Burr
  • 333,147
  • 50
  • 533
  • 760
  • 1
    Need some serious elaboration on the meat of the answer: "*the probability of a false positive would be higher than using multiple hash functions*"... – Pacerier May 06 '17 at 14:13
1

A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set or not. It is called probabilistic because while a negative answer is 100% accurate, there might be false positives means that if the answer is "Yes, this element presents in the data set` it is not 100% accurate.

Why Bloom filter is Memory Efficient:

Basic Bloom filters don’t store data; they just answer the question whether an element is a member of a set or not. To implement a bloom filter we need bits array or think it as a binary number initially with zeroes. (I think the size should be 2x of the size of the data)enter image description here

Then to check if an element exists you apply k (k<lengthOfBits) different hash function. This k is determined when you start to implement bloom filter and you store this k in memory. each hash function outputs an index between 0 and length-1. Convert the result index element to 1.

enter image description here

Eventually, you are converting this binary representation into 10 base number and let sey it is 1000003232232. Now this number is the bloom filter and you are storing this number in memory. What ever byte size that it holds, you are consuming that byte size

Bloom filters are used for caching but not used everywhere. If you know some applications of bloom filters, you will see how helpful they are:

Routers

Modern routers have limited space and, with the volume of packets they process per second, they need extremely fast algorithms. They are thus the perfect recipient for Bloom filters, for all those operations that can cope with a small rate of errors. Besides caching, routers often employ Bloom filters to keep track of forbidden IPs and to maintain statistics that will be used to reveal DoS attacks.

Crawlers

Crawlers are automated software agents scanning a network and looking for content, parsing, and indexing anything they find. When a crawler finds links in a page or document, it is usually programmed to follow them and recursively crawl the link’s destination.

Crawlers ignores links created using tags with an attribute rel="nofollow".

It is actually recommended that you mark in this way any anchor with a link to an action having side effects. Otherwise, search engines’ crawlers, even if they respect this policy, will cause unpredictable behavior.

What can happen is that if you write your own crawler and you are not careful, it might end up in an endless loop between two or more pages with mutual links (or chain of links) to each other. To avoid such loops, crawlers need to keep track of pages they already visited.

Bloom filters are the best way to do so because they can store URLs in a compact way and perform checking and saving of the URLs in constant time.

IO Fetcher

Bloom filter-based caching helps in reducing the unnecessary fetching/storage of expensive IO resources. The mechanism is the same as with crawling: the operation is only performed when we have a “miss,” while “hits” usually trigger a more in-depth comparison (for instance, on a hit, retrieving from disk just the first few lines or the first block of a document, and comparing them).

Spell Checker

Simpler versions of spell checkers used to implement Bloom filters as dictionaries. Today, however, spell checkers mostly take advantage of tries: these data structures provide good performance on text searches without false positives.

Distributed databases and file systems

Cassandra uses Bloom filters for index scans to determine whether an SSTable has data for a particular row. Likewise, Apache HBase uses Bloom filters as an efficient mechanism to test whether a StoreFile contains a specific row or row-col cell. This in turn boosts the overall read speed, by filtering out unnecessary disk reads of HFile blocks that don’t contain a particular row or row-column.

reference

Yilmaz
  • 35,338
  • 10
  • 157
  • 202
1

Anything you can accomplish with a Bloom filter, you could accomplish in less space, more efficiently, using a single hash function rather than multiple

Not true. Using multiple hash functions can give us a lower false positive rate. Specifically, given a fixed space in memory we're willing to use for our Bloom filter in bits (m bits) and a fixed number of elements it will contain (n), the optimal number of hash functions k is k = (m/n) ln 2.

Why isn't it optimal to use a single hash function, as you suggest? The easiest way to see this is to consider a Bloom filter containing a single element. If you only use one hash function, you have a 1/m chance that any distinct element you look up will yield a false positive (due to hashing to the same value as the element truly contained in the filter). But if you have k > 1 hash functions instead, the chance of a false positive is much lower (roughly 1/mᵏ if m >> k) because all k hash values need to collide. More generally, as long as n is small, m is large, and k is small, increasing k will reduce your false positive rate.

Why isn't the optimal number something huge? Because if you have too many hash functions, you'll set all (or nearly all) the bits in your Bloom filter to 1, resulting in a 100% (or close to 100%) false positive rate.

Thus the false-positive-minimising solution is to use multiple hash functions (but not too many).

Mark Amery
  • 143,130
  • 81
  • 406
  • 459
0

Using hash table for 1M elements means finding collision after each 1k insertions because of the Birthday's Paradox. Bloom filter helps to reduce number of collisions significantly, but also can give false positive results.

k06a
  • 17,755
  • 10
  • 70
  • 110