The count-min sketch is a probabilistic data structure for lossy storage of counts in a multiset. It receives updates (i, c)
where i
is an element of a set and c
is a non-negative quantity for that element, then does clever things with hash functions. It is widely discussed on SO and elsewhere; here is the original paper (PDF) and the Wikipedia article. Based on the application I am considering it for -- lossy storage of count data from single-cell genomics experiments -- let's assume i
and c
are both integers. The pair i,c
means that in a given biological cell, gene i
was detected c
times.
My question is about how much memory the count-min sketch takes compared to sparse matrix formats more commonly used for this type of data. For a simple example of an alternative, consider a hash table -- say, a Python dictionary -- storing each distinct value of c
with the sum of the corresponding values of i
. If n distinct genes are observed in a given cell, then this takes O(n) space. This answer explains that, to store counts of n distinct genes, the count-min sketch also takes O(n) space. (Identifiers for the genes are stored separately as an array of strings.)
I don't understand why anyone would introduce so much complexity for what seems to be no improvement in compression. I also don't understand what's special about this application that would render the count-min sketch useless when it's useful for lots of other purposes. So:
- For this application, does the count-min sketch save space over typical sparse matrix storage schemes?
- Is there any application for which the count-min sketch saves space over typical sparse matrix storage schemes? If so, what is the key difference from this application?