2

Working on wrapping my head around the CountSketch data structure and its associated algorithms. It seems to be a great tool for finding common elements in streaming data, and the additive nature of it makes for some fun properties with finding large changes in frequency, perhaps similar to what Twitter uses for trending topics.

The paper is a little difficult to understand for someone that has been away from more academic approaches for a while, and a previous post here did help some, for me at least it still left quite a few questions.

As I understand it, the Count Sketch structure is similar to a bloom filter. However the selection of hash functions has me confused. The structure is an N by M table with N hash functions with M possible values determining the "bucket" to alter, and another hash function s for each N that is "pairwise independent"

Are the hashes to be selected from a universal hashing family, say something of the h(x) = ((ax+b) % some_prime) % M?

And if so, where are the s hashes that return either +1 or -1 chosen from? And what is the reason for ever subtracting from one of the buckets?

Community
  • 1
  • 1
Peck
  • 822
  • 1
  • 9
  • 26

1 Answers1

3

They subtract from the buckets to make average effect of additions/subtractions caused by other occurrences to be 0. If half the time I add the count of 'foo', and half the time I subtract the count of 'foo', then in expectation, the count of 'foo' does not influence the estimate of the count for 'bar'.

Picking a universal hash function like you describe will indeed work, but it's mostly important for the theory rather than the practice. Salting your favorite reasonable hash function will work too, you just can't meaningfully write proofs based on the expected values using a few fixed hash functions.

Rob Neuhaus
  • 9,190
  • 3
  • 28
  • 37
  • If when foo is added, half the buckets get -1 and half get +1, then won't the median returned by the ESTIMATE function of foo tend towards 0? – Peck Jan 05 '12 at 12:28
  • It is true that without conditioning on any information, the average counts are 0. But now fix a single count, whose mean is indeed 0. Now consider the 'foo' event, you know the direction of movement for 'foo'. Conditioned on knowing the direction for 'foo', you have a bias towards 'foo's direction, and the degree that the estimate varies from 0 to that direction, you credit the count of 'foo'. – Rob Neuhaus Jan 05 '12 at 16:12
  • The inside of the median has a bias towards the actual thing being estimated, and the median just gets rid of lots of the variance. – Rob Neuhaus Jan 05 '12 at 16:24
  • Thanks for replies, lost me a little bit there on the conditioning part and "direction of". Is it essentially using half of the s's for any given key ('f00', 'bar','baa') to serve as a counterweight to all the other terms, and half to add to its own degree? It seems this throws the estimate off by quite a bit, but perhaps Im looking at ESTIMATE too literally and its more a relative to other keys stored in the sketch? – Peck Jan 05 '12 at 18:30
  • And s is deterministic, correct? If it needs to be roughly split between -{1,+1} then wouldn't a randomly generated value (over a long enough time) assure this more than a fixed hash? Perhaps I need to refresh on my hashing education :) – Peck Jan 05 '12 at 18:36
  • by "direction of", I just mean the s[q], the sign of the hash value for q. negative counts are a feature, not a bug. – Rob Neuhaus Jan 05 '12 at 19:01
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/6442/discussion-between-rrenaud-and-peck) – Rob Neuhaus Jan 05 '12 at 20:06