Non-trivial usage of count-min sketch data-structure

Question

I have a large array with increasing values - like this:

array = [0, 1, 6, 6, 12, 13, 22, ..., 92939, 92940]

and I want to use interpolation-search algorithm on it. Size of the array is variable, new elements are added to the end of the array.

I need to find index of some element, let's call it X.

Y = find(X in array)

Y must be an index of the element from the array such that array[Y] >= X find can be implemented with binary search but for some complicated reasons I want to implemented it using interpolation search. Interpolation search tries to guess correct position of the X by looking at the bounds of the array. If first array value is 0 and last is 100 and I want to find position of the value 25 than, if array length is 1000, I need to look at value at index 250 first. This works as a charm if values of the array is evenly distributed. But if they not evenly distributed, interpolation search can work slower than binary search (there is some optimizations possible).

I'm trying to speed up search in such cases using Count-Min Sketch data structure. When I appending new element to the array I just add some data to count-min sketch data-structure.

Z = 1005000 
elements_before_Z = len(array)
array.append(Z)
count_min_sketch.add(Z, elements_before_Z)  
# Z is the key and elenents_before_Z - is count

using this approach I can guess position of the searched element X approximately. This can result in search speedup if guess is correct, but I've ran into some problems.

I don't know if X is in array and my count_min_sketch have seen this value. If this is the case I can get correct value out of count_min_sketch datastructure. If it's not - I will get 0 or some other value (worst case scenario).
Collisions. If value X have been seen by my count_min_sketch object - than I get back correct value or larger value. If count min sketch used for something like counting word occurrence in document - this is not a problem because collisions is rare and error is less or equal than number of collisions (it usually used like this: count_min_sketch.add(Z, 1)). In my case, every collision can result in large error, because I usually add large numbers for every key.

Is it possible to use count-min sketch in such way (adding large number of entries every time)?

Integer predecessor is a well-studied problem in Computer Science. You might want to have a look at [vEB trees](http://en.wikipedia.org/wiki/Van_Emde_Boas_tree) and [y-fast tries](http://en.wikipedia.org/wiki/Y-fast_trie) in any real-world scenario. But honestly, I don't think they will beat a standard binary search. And as you mentioned, hashing-based set data structures such as CM Sketch or Bloom filters won't help you, because they achieve their speed by randomizing the order of elements, which is exactly what you can't have. — Niklas B., Jun 03 '14 at 15:12
FWIW, if you know that adjacent elements will differ on average by at most by some small constant k, you can use a hash table/sketch data structure and just try to find exactly the values X, X - 1, X - 2, ... until you hit a number that is in the set. — Niklas B., Jun 03 '14 at 15:17
I think that I can just right shift the key (X >> k) and than add result to CM sketch, but your approach is more precise. — Evgeny Lazin, Jun 03 '14 at 15:29
Well yeah that depends on whether you have more updates or more queries. — Niklas B., Jun 03 '14 at 15:30
I can beat standard binary search because of caching effects. All my data is in memory mapped file and it's huge. So, every step of the binary search will likely result in page-fault. — Evgeny Lazin, Jun 03 '14 at 15:30
Yep, I already beat it on some dataset (most common in my case) but if dataset contains gaps or it is skewed, interpolation works bad :( — Evgeny Lazin, Jun 03 '14 at 15:32
well one simple idea would be to sample every x-th element of your data so that you can binary search in the sampled data and then know the approximate position in the real array — Niklas B., Jun 03 '14 at 15:39
Obviously you can do that in multiple (let's say N) layers, so that you have at most N cache misses — Niklas B., Jun 03 '14 at 15:44
Count-min sketch seems rather underpowered for this application. On huge data with lots of distinct elements and a not-huge sketch, you'll often get a garbage number for things not in the set and values that are rather too high for many things that are in the set. Why not do either random sampling or the regular sampling suggested by Niklas B.? — tmyklebu, Jun 03 '14 at 17:23
If you do one iteration of Binary Search every time an Interpolation Search probe does not reduce the area of uncertainty by at least a factor of two, then the worst case is no more than twice the cost of Binary Search and the best case is Interpolation Search and some simple checks. — mcdowella, Jun 03 '14 at 17:45

Non-trivial usage of count-min sketch data-structure

0 Answers0