7

I have a set of uint32 integers, there may be millions of items in the set. 50-70% of them are consecutive, but in input stream they appear in unpredictable order.

I need to:

  1. Compress this set into ranges to achieve space efficient representation. Already implemented this using trivial algorithm, since ranges computed only once speed is not important here. After this transformation number of resulting ranges is typically within 5 000-10 000, many of them are single-item, of course.

  2. Test membership of some integer, information about specific range in the set is not required. This one must be very fast -- O(1). Was thinking about minimal perfect hash functions, but they do not play well with ranges. Bitsets are very space inefficient. Other structures, like binary trees, has complexity of O(log n), worst thing with them that implementation make many conditional jumps and processor can not predict them well giving poor performance.

Is there any data structure or algorithm specialized in integer ranges to solve this task?

actual
  • 2,370
  • 1
  • 21
  • 32
  • Can you be a bit more specific about what operations you need? From what I've read, you have a preexisting set of ranges, and from them you want to support the operation "what range, if any, contains the given integer?" Is this correct? – templatetypedef Jan 05 '11 at 07:36
  • @templatetypedef: I need just yes/no answer to "is this number in set?" for preexisting set. The main question is how to do it in O(1) with practical space requirements. – actual Jan 05 '11 at 07:48
  • 2
    Another thought - have you considered using something like a binary decision diagram? I remember that Don Knuth once talked about using zero-suppressed binary decision diagrams for encoding functions that are mostly zero (in your case, you have a function from 32 bits to whether the number is present, and most of the time it isn't). This would give you your O(1) lookup time (since each lookup takes at most 32 steps), though I'm not sure how space-efficient it is. – templatetypedef Jan 05 '11 at 07:59
  • a bitset would be 512mb. Is this really too much space? There wont be another practical O(1) data structure. – Markus Kull Jan 05 '11 at 08:09
  • @Markus Kull: 512MB will be too much since there is a need to work with multiple sets simultaneously. – actual Jan 05 '11 at 08:14

7 Answers7

11

Regarding the second issue:

You could look-up on Bloom Filters. Bloom Filters are specifically designed to answer the membership question in O(1), though the response is either no or maybe (which is not as clear cut as a yes/no :p).

In the maybe case, of course, you need further processing to actually answer the question (unless a probabilistic answer is sufficient in your case), but even so the Bloom Filter may act as a gate keeper, and reject most of the queries outright.

Also, you might want to keep actual ranges and degenerate ranges (single elements) in different structures.

  • single elements may be best stored in a hash-table
  • actual ranges can be stored in a sorted array

This diminishes the number of elements stored in the sorted array, and thus the complexity of the binary search performed there. Since you state that many ranges are degenerate, I take it that you only have some 500-1000 ranges (ie, an order of magnitude less), and log(1000) ~ 10

I would therefore suggest the following steps:

  • Bloom Filter: if no, stop
  • Sorted Array of real ranges: if yes, stop
  • Hash Table of single elements

The Sorted Array test is performed first, because from the number you give (millions of number coalesced in a a few thousands of ranges) if a number is contained, chances are it'll be in a range rather than being single :)

One last note: beware of O(1), while it may seem appealing, you are not here in an asymptotic case. Barely 5000-10000 ranges is few, as log(10000) is something like 13. So don't pessimize your implementation by getting a O(1) solution with such a high constant factor that it actually runs slower than a O(log N) solution :)

Matthieu M.
  • 287,565
  • 48
  • 449
  • 722
  • Looks very practical and promising. From the Wikepedia article, Bloom Filter requires 4.8bit per item, so we could have it with about 25% space overhead. Was my reading right? – 9dan Jan 05 '11 at 10:13
  • I think, storing ranges and numbers in different structure alone could be one important breakthrough. – 9dan Jan 05 '11 at 10:18
  • @9dan: it is a parameter. Depending on the percentage of false positive you wish to achieve you may tune it. The challenge however is generally not to come up with `m` and `k`, but to actually define the hash functions :) – Matthieu M. Jan 05 '11 at 10:22
  • @9dan: regarding the separation of ranges/numbers, I would certainly begin by this before implementing a Bloom Filter since it's a complicated beast. – Matthieu M. Jan 05 '11 at 10:23
7

If you know in advance what the ranges are, then you can check whether a given integer is present in one of the ranges in O(lg n) using the strategy outlined below. It's not O(1), but it's still quite fast in practice.

The idea behind this approach is that if you've merged all of the ranges together, you have a collection of disjoint ranges on the number line. From there, you can define an ordering on those intervals by saying that the interval [a, b] ≤ [c, d] iff b ≤ c. This is a total ordering because all of the ranges are disjoint. You can thus put all of the intervals together into a static array and then sort them by this ordering. This means that the leftmost interval is in the first slot of the array, and the rightmost interval is in the rightmost slot. This construction takes O(n lg n) time.

To check if a some interval contains a given integer, you can do a binary search on this array. Starting at the middle interval, check if the integer is contained in that interval. If so, you're done. Otherwise, if the value is less than the smallest value in the range, continue the search on the left, and if the value is greater than the largest value in the range, continue the search on the right. This is essentially a standard binary search, and it should run in O(lg n) time.

Hope this helps!

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
  • 1
    Yep, that is my current implementation :) It is about 6-7 times slower than hash table on my test cases, but hash table is very space inefficient. Anyway, +1 for posting ;) – actual Jan 05 '11 at 08:03
  • 2
    You might be able to optimize this a bit by checking the largest ranges first, for a better chance of scoring an early hit. Perhaps make a seperate list of all the ranges with over some number of elements and do a binary search on that. – Null Set Jan 05 '11 at 08:15
  • @actual: implementation detail -> using a binary tree to actually build the ranges is fine, however once the ranges are stable, you can compress the information into a sorted array of pairs. The binary search has the same complexity, however it greatly increase memory locality. – Matthieu M. Jan 05 '11 at 09:07
  • @Matthieu M., yeah, I just named the class of similar algorithms and data structures as binary trees. In fact, I use binary search on arrays. – actual Jan 08 '11 at 10:19
2

AFAIK there is no such algorithm that search over integer list in O(1).

One only can do O(1) search with vast amount of memory.

So it is not very promising to try to find O(1) search algorithm over list of range of integer.

On the other hand, you could try time/memory trade-off approach by carefully examining your data set (eventually building a kind of hash table).

9dan
  • 4,222
  • 2
  • 29
  • 44
  • Thanks for idea. I think I will try to make binary search on large ranges and some sort of hashing, maybe minimal, on single-item and small ranges. – actual Jan 05 '11 at 08:20
  • 2
    Well, [bucket sort](http://en.wikipedia.org/wiki/Counting_sort) can search a list in O(1) time. It might be better to say that no algorithm which uses *comparisons* can search over an integer list in O(1). – David Weiser Jan 05 '11 at 17:26
2

You can use y-fast trees or van Emde Boas trees to achieve O(lg w) time queries, where w is the number of bits in a word, and you can use fusion trees to achieve O(lg_w n) time queries. The optimal tradeoff in terms of n is O(sqrt(lg(n))).

The easiest of these to implement is probably y-fast trees. They are probably faster than doing binary search, though they require roughly O(lg w) = O(lg 32) = O(5) hash table queries, while binary search requires roughly O(lg n) = O(lg 10000) = O(13) comparisons, so binary search may be faster.

jonderry
  • 23,013
  • 32
  • 104
  • 171
1

Rather than a 'comparison' based storage/retrieval ( which will always be O(log(n)) ), You need to work on 'radix' based storage/retrieval .

In other words .. extract nibbles from the uint32, and make a trie ..

vrdhn
  • 4,024
  • 3
  • 31
  • 39
1

Keep your ranges into a sorted array and use binary search for lookups.

It's easy to implement, O(log N), and uses less memory and needs less memory accesses than any other tree based approach, so it will probably be also much faster.

salva
  • 9,943
  • 4
  • 29
  • 57
1

From the description of you problem it sounds like the following might be a good compromise. I've described it using an Object oriented language, but is easily convertible to C using a union type or structure with a type member and a pointer.

Use the first 16 bits to index an array of objects (of size 65536). In that array there are 5 possible objects

  • a NONE object means no elements beginning with those 16bits are in the set
  • an ALL object means all elements beginning with 16 bits are in the set
  • a RANGE object means all items with the final 16bits between an upper and lower bound are in the set
  • a SINGLE object means just one element beginning with the 16bits is in the array
  • a BITSET object handles all remaining cases with a 65536 bit bitset

Of course, you don't need to split at 16bits, you can adjust to reflect the statistics of your set. In fact you don't need to use consecutive bits, but it speeds up the bit twiddling, and if many of your elements are consecutive as you claim will give good properties.

Hopefully this makes sense, please comment if I need to explain more fully. Effectively you've combined a depth 2 binary tree with a ranges and a bitset for a time/speed tradeoff. If you need to save memory then make the tree deeper with a corresponding slight increase in lookup time.

Nick Fortescue
  • 43,045
  • 26
  • 106
  • 134