2

I am looking around for the best algorithms for the bitset operations like intersection and union, and found a lot of links and similar questions also.

Eg: Similar Question on Stack-Overflow

One thing however, which I am trying to understand is that where bit set stands into this. Eg, Lucene has taken BitSet operations to give a high performing set operations, specially because it can work at a lower level.

However, what looks to me is, the bit-set will start performing slow and slow, as the number of elements increase and the set is sparse, say set has ~10 elements where the max number of elements can be 2 Billion, because that will call out for unnecessary matching. What do you suggest ?

Community
  • 1
  • 1
Love Hasija
  • 2,528
  • 2
  • 27
  • 26

2 Answers2

3

Bit Sets indeed make sense for dense sets, i.e. covering a significant fraction of the domain, as they represent every possible element. The space and running time requirements are O(D) [D = domain size = 2 billion !].

Sorted Set operations represent only the elements in the given set and will have an O(E) behavior [E = number of elements = 10], much more appropriate.

Bit Sets are fast, they are not efficient. I mean their hidden constant is smaller. They are blazingly fast for small sets (say D <= 1024) as they can process 32/64 elements in a single CPU instruction.

  • Thanks Yves, Is there any incremental algorithms than can make use of BitSets for larger domain size. – Love Hasija Jan 23 '14 at 09:33
  • Say, as an example, very random thought, but partitioning the data into chunks of 1024. – Love Hasija Jan 23 '14 at 09:37
  • To decide of an optimization strategy, you need to have more information on the sparsity of the set and the scenarios of union/intersection operations you really need to perform. –  Jan 23 '14 at 09:57
  • Renumbering the elements so that they span the smallest possible range could be an improvement. –  Jan 23 '14 at 09:59
0

For sparse bitsets you can greatly improve performance (and reduce memory usage) using sparse bitmaps where you divide your data into chunks as opposed to storing everything under a single key.

When using bitmaps for analytics, you have a limited number of users active at any given time (e.g. day) and sparse bitmaps use this fact to their advantage.

Shameless plug: http://github.com/bilus/redis-bitops (if you're using Ruby but there are also performance notes there).

Marcin Bilski
  • 572
  • 6
  • 13