1

Basically I need to keep track of a large number of counters. I can increment or decrement each counter by name. The simplest way to do so is to use a hash table, using counter_name as key and its corresponding count as the value for that key.

The counters don't need to be 100% accurate, approximate values for count are fine. So I'm wondering if there is any probabilistic data structure that can reduce the space complexity of N counters to lower than O(N), kinda similar to how HyperLogLog reduces the memory requirement of counting N items by giving only an approximate result. Any ideas?

Continuation
  • 12,722
  • 20
  • 82
  • 106
  • Do you really have a space constraint, or is saving time more important? Is there some requirement to access counters by name? Is it a requirement to always access a counter by name, or only sometimes? – Spencer Apr 13 '16 at 00:04
  • You can keep an approximate counter by incrementing probabilistically. For instance, if you're willing to lose k bits of precision to save k bits of space, for each increment request, only actually increment the counter with probability 2^(-k). – jbapple Apr 13 '16 at 03:55

3 Answers3

2

In my opinion, the thing you are looking for is Count-min sketch.

Reading a stream of elements a1, a2, a3, ..., an where there can be a lot of repeated elements, in any time it will give you the answer to the following question: how many ai elements have you seen so far.

basically your unique elements can be bijected into your counters. Countmin sketch allows you to adjust parameters to trade your memory for the accuracy.

P.S. I described some other popular probabilistic data structures here.

Community
  • 1
  • 1
Salvador Dali
  • 214,103
  • 147
  • 703
  • 753
1

Stefan Haustein's correct that the names are likely to take more space than the counters, and you may be able to prioritise certain names as he suggests, but failing that you can consider how best to store the names. If they're fairly short (e.g. 8 characters or less), you might consider using a closed hashing table that stores them directly in the buckets. If they're long, you could store them contiguously (NUL terminated) in a block of memory, and in the hash table store the offset into that block of their first character.

For the counter itself, you can save space by using a probabilistic approach as follows:

template <typename T, typename Q = unsigned>
class Approx_Counter
{
  public:
    Approx_Counter() : n_(0) { }

    Approx_Counter& operator++()
    {
        if (n_ < 2 || rand() % (operator Q()) == 0)
            ++n_;
        return *this;
    }

    operator Q() const { return n_ < 2 ? n_ : 1 << n_; }

  private:
    T n_;
};

Then you can use e.g. Approx_Counter<unsigned char, unsigned long>. Swap out rand() for a C++11 generator if you care.

The idea's simple:

  • when n_ is 0, ++ has definitely not be invoked
  • when n_ is 1, ++ has definitely been invoked exactly once
  • when n_ >= 2, it indicates ++ has probably been invoked about 2n_ times

To keep that last implication in line with the number of ++ invocations actually made, each invocation has a 1 in 2n_ chance of actually incrementing n_ again.

Just make sure your rand() or substitute returns values much larger than the largest counter value you want to track, otherwise you'll get rand() % (operator Q()) == 0 too often and increment inappropriately.

That said, having a smaller counter doesn't help much if you have pointers or offsets to it, so you'll want to squeeze the counter into the bucket too, another reason to prefer your own closed hashing implementation if you genuinely need to tighten up memory usage but want to stick with a hash table (a trie is another possibility).

The above is still O(N) in counter space, just with a smaller constant. For genuinely < O(N) options, you need to consider whether/how keys are related, such that incrementing a counter might reasonable impact multiple keys. You've given us no insights in your question to date.

Tony Delroy
  • 102,968
  • 15
  • 177
  • 252
0

The names probably take up more space than the counters.

How about having a fixed number of counters and only keep the ones with the highest counts, plus some kind of LRU mechanism to allow new counters to rise to the top? I guess it really depends on your use case...

Stefan Haustein
  • 18,427
  • 3
  • 36
  • 51