0

I know of the following algorithm for heavy-hitters:

Algorithm findHeavyHitters(epsilon, inputStream)
    integer k = ceiling(1 / epsilon) - 1
    initialize hashmap H of size k

    while an item i from the input stream arrives:
        if H[i] exists
            increment the value associated with H[i]
        elsif number of items in H < k
            put H[i] into map with value of 1
        elseif there exists an entry j with a value of 0
            remove j and put H[i] into map with value of 1
        else
            decrement all values in H by 1
    endwhile

    return H

Correct me if I'm wrong, but this algorithm does not run in O(n). Is it possible to modify this algorithm so that it runs in O(n) while maintaining the O(1/epsilon) use of space?

For a stream of data, the point of the algorithm is to return the top epsilon*t items. Epsilon is given as a percentage (e.g. input 0.1 for data that occurs at least 10% of the time).

Community
  • 1
  • 1
fluffychaos
  • 201
  • 3
  • 12
  • What makes you think it doesn't run in O(n)? – rici Jun 16 '16 at 06:32
  • In the else block, we need to iterate over all the existing heavy-hitters to decrement each by 1. – fluffychaos Jun 16 '16 at 06:38
  • I thought that was your objection, so I covered it in my answer. The possible objection that hash table lookups are worst-case O(n) cannot so easily be fixed; afaik, you cannot get the worst case to be less than O(n log n). – rici Jun 16 '16 at 07:00
  • Related: http://stackoverflow.com/a/21705869/1566221 – rici Jun 16 '16 at 07:12
  • @fluffychaos I'm planning to implement heavy hitters using count-min sketch. Can you please explain which cases does the else block(decrement all values by 1) handles? Also how does adding a new variable base help in the solution suggested in the answer? – user3508140 Aug 25 '19 at 19:21

1 Answers1

1

The algorithm runs in average time O(n), on the basis that a hash lookup is, on average, O(1).

There are two implementation details. First, the last step which seems to involve touching every value in H:

  • decrement all values in H by 1

To make this O(1), we add one additional storage location, called base, which is initialized to 0. Then we modify the algorithm as follows:

while an item i from the input stream arrives:
    if H[i] exists
        increment the value associated with H[i]
    elsif number of items in H < k
        put H[i] into map with value of base + 1
    elseif there exists an entry j with a value of base 
        remove j and put H[i] into map with value of base + 1
    else
        increment base
endwhile

The second issue is finding an entry with the value base (or 0) in O(1). This can be done by keeping the elements in a "comb": a linked list of doubly-linked lists. Each inner linked list contains the entries with a specific count. The outer linked list contains the count lists in order by count, with the head pointing to the list with the smallest count. If you draw this data structure, it looks like a comb:

[  base    ] -> entry a -> entry b -> entry c
    |
[ base + i ] -> entry d
    |
[ base + j ] -> entry e -> entry f
    |
   etc.

The hash table now points to the entries, rather than containing them. To increment the count of a single entry, the entry is removed from its list (if the list contains more than one element) and either inserted into the next list or put into a one-element list which is inserted after the list it was in, depending on the count associated with the next list. This operation is O(1).

The comb datastructure is still O(k) where k is the number of elements in the hash, because there cannot be more distinct counts than elements.

Instead of doubly-linked lists, you can use a simple array and a list of the indices of the first entry with each count. To move an entry to the next count bucket, you start by swapping it with last entry with that count, and then either advance the beginning of the next count list or insert a new count list entry, depending on whether the next count list's count is one greater or more than one greater. To accomplish the swap, it is necessary to update the location of the two swapped entries in the hash, but that is still O(1).

rici
  • 234,347
  • 28
  • 237
  • 341
  • Cool data structure - thanks! I'm still having trouble understanding how finding an entry is done in O(1) though. If you use a LinkedList of LinkedLists, in the worst case, don't you have to iterate over all values of the outer LinkedList to find which count the entry is associated with? This can potentially still be a O(k) traversal. Or, in the case where all k elements are mapped to the same count, then wouldn't you potentially need to traverse all k items of the inner list to find the node you want? – fluffychaos Jun 16 '16 at 14:29
  • @fluffychaos: You use the hash table, as in your original algorithm. The hash table contains pointers to the linked-list nodes. Splicing linked lists doesn't move the nodes around, so the hashtable pointers will still work. If you use the partitioned array data structure (at the end), you need to modify the hashtable links for the swapped elements, but there are always at most two swapped elements (sometimes no swap is necessary). (I mentioned all that in the answer.) – rici Jun 16 '16 at 19:30