4

The count-min sketch is a probabilistic data structure for lossy storage of counts in a multiset. It receives updates (i, c) where i is an element of a set and c is a non-negative quantity for that element, then does clever things with hash functions. It is widely discussed on SO and elsewhere; here is the original paper (PDF) and the Wikipedia article. Based on the application I am considering it for -- lossy storage of count data from single-cell genomics experiments -- let's assume i and c are both integers. The pair i,c means that in a given biological cell, gene i was detected c times.

My question is about how much memory the count-min sketch takes compared to sparse matrix formats more commonly used for this type of data. For a simple example of an alternative, consider a hash table -- say, a Python dictionary -- storing each distinct value of c with the sum of the corresponding values of i. If n distinct genes are observed in a given cell, then this takes O(n) space. This answer explains that, to store counts of n distinct genes, the count-min sketch also takes O(n) space. (Identifiers for the genes are stored separately as an array of strings.)

I don't understand why anyone would introduce so much complexity for what seems to be no improvement in compression. I also don't understand what's special about this application that would render the count-min sketch useless when it's useful for lots of other purposes. So:

  • For this application, does the count-min sketch save space over typical sparse matrix storage schemes?
  • Is there any application for which the count-min sketch saves space over typical sparse matrix storage schemes? If so, what is the key difference from this application?
templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
eric_kernfeld
  • 495
  • 5
  • 17
  • I'm not sure what you're asking. This isn't a data structure that stores data. If you need to store data it's not for you. If you need just the specific things that count-min does well, just calculate them directly? If you're just looking to cut down storage space, try a binary format with appropriate numeric types instead of an ascii coordinate format. – CJR Oct 16 '20 at 07:24
  • True, matrix market is a file format and not a data structure. I've edited to clarify. Does that help? – eric_kernfeld Oct 16 '20 at 13:04
  • No - that one stores data. It's not super efficient (because it's storing numbers in text format) but it does store data. Count-min doesn't store data. If you want to store data it is not for you. – CJR Oct 16 '20 at 16:23

2 Answers2

3

Count-min sketches are primarily, but not always, used in applications where you’re trying to find the most frequent items in a data stream. The idea is that, since a count-min sketch will (usually) artificially boost the apparent frequency of each item, if an item has a high frequency it will always appear to have a high frequency when you get the estimate from the count-min sketch, but if an item has a low frequency it’ll have a larger but still low-ish frequency estimate.

This makes count-min sketches excellent choices for situations like finding the most popular searches on Google or the most-viewed items on Amazon. You can configure a count-min sketch to use very little space compared with a traditional hash table - exactly how much space you need is up to you, since you can tune the accuracy and confidence parameters based on your available memory - and still be confident in the estimates you get back.

On the other hand, if you’re working on an application in which it’s important to store the true counts of each item you store, or where low-frequency items need to be identified as such, then a count-min sketch isn’t really going to help all that much. For that, there really isn’t much you can do to improve over, say, a hash table.

Keep in mind that, in general, there’s no way to compress arbitrary frequency data losslessly. The reason a count-min sketch can work so well for finding frequent items is that it can afford to lose exact counts for all the low-frequency elements. This doesn’t work for tracking low-frequency elements because, typically, there’s way more low-frequency elements than high-frequency elements and throwing away the high-frequency elements won’t reduce the data size all that much.

So the answer to your question is “it depends on what you’re doing.” If your application needs precise counts and it’s really bad to overestimate frequencies, just use a regular hash table. If you’re just looking for the most common genes, then a count-min sketch might be a great choice.

Dharman
  • 30,962
  • 25
  • 85
  • 135
templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
0

As an alternate answer to my own question: I think I misunderstood the answer I linked to. Contrary to my question's premise, it never states that the count-min sketch takes O(n) space. The space requirements depend on the desired accuracy.

eric_kernfeld
  • 495
  • 5
  • 17