3

I am learning Bloom filter and BitMap(also known as Bit Array) and met a question,can someone give me some instructions on when to use Bloom filter and when to use BitMap?

In my understanding I think that when we need to find the largest number or want to sort the huge data,BitMap is more suitable(for pure digit).

If we want to check some IP address are contained in billions of existed records,then Bloom filter is more suitable(for string or other none pure digit).

However,I want to someone to give me more detailed instructions or suggestions,I have searched on Google and do not find some useful info. Thanks in advance!

Also I do not know if shall I put this question on stackoverflow or other sites,if it's not the right site,hope someone can point it out,thanks!

flyingfox
  • 13,414
  • 3
  • 24
  • 39
  • I don't fully understand the question... what do you mean with BitMap? A Bloom filter internally uses a bit array / bit set / bit map... And I don't think you need a BitMap (whatever that is) to find the largest number, you just use max(x, y). And for sorting, you use radix sort or quicksort or mergesort, or similar... but not a BitMap. So please explain / link to what you mean with BitMap. – Thomas Mueller Apr 16 '19 at 12:58
  • @ThomasMueller I just want to know when to use `BitMap` and when to use `Bloom Filter` – flyingfox Apr 17 '19 at 02:13
  • And how can someone answer, if he doesn't know what you mean with "BitMap"? – Thomas Mueller Apr 17 '19 at 06:51
  • @ThomasMueller `BitMap` might cause misunderstanding,I have updated the tag in my question – flyingfox Apr 17 '19 at 11:15
  • Please refer to the answer for more use case scenarios: https://stackoverflow.com/a/30247022/7750999 – Caffeine Coder Mar 17 '20 at 15:39

1 Answers1

6

When to use a Bloom filter: if you have a set (a list of unique entries) and a hash function, you can create a Bloom filter. It allows queries of type "is entry x likely in the set?". The query will return true for sure if the entry is in the set. For entries not in the set, it may return true, with a low probability, typically 1% or lower (depending on the size of the Bloom filter). You can make the Bloom filter as small as you like, but for a false positive rate of around 1% you need about 10 bits per entry. There are alternative algorithms / data structures that use less space, see e.g. https://github.com/FastFilter. A Bloom filter internally uses a bit array by the way.

When to use a bit array (also called bitset): if the entries are numbers between 0..n, then you can use a bit array as follows: set the bit x for each entry. That will require n bits (no matter how many entries). So if your entries can be large numbers, then there is a problem: it will use a lot of memory. However, you could create a sparse bit array (also called compressed bit array), e.g. using https://roaringbitmap.org/. You won't have false positives as with Bloom filters, but size usage depends a lot on your data (on the numbers you have), much more so than with a Bloom filter.

Thomas Mueller
  • 48,905
  • 14
  • 116
  • 132
  • is there an upper bound to roaring bitmap memory usage ? Similarly for checking set membership and constructing bloom or binary fuse or roaring bitmap faster (adding say million elements)? I understand roaring bitmap a lot depends on data distribution but are there any pointers to any benchmarks on typical bits per value and construction , set membership benchmarks . – user179156 Sep 14 '22 at 08:59