Is it possible to dedup hyperloglog such that adding and deleting element will yield relatively correct unique count?

Question

If I want to get the unique count in a list of element that can be added and deleted, is there a way to do that?

For example

add key1
delete key1
add key1

should give a unique count of 1

but if I have a naive method of 2 hll one for delete and one for add, it return 0?

Is there a way I can dedup key in hll?

score 0 · Answer 1 · answered Jun 01 '18 at 17:31

I do not see how to do it with hyper log log, but I do see how to do it with less efficient cardinality estimators.

Here is a simple cardinality estimator, which you can find in http://www.cse.unsw.edu.au/~cs9314/07s1/lectures/Lin_CS9314_References/fm85.pdf. Calculate the hash value of each element. Keep the smallest m hash values. Use the size of the mth hash value to estimate the cardinality of the whole set. (Let's ignore hash collisions.)

Now here is an adaptation to handle some deletions. Keep the smallest 2m hash values. Use the size of the m'th smallest to estimate the cardinality of the whole set. If a hash element should be deleted, just remove it from the set. As long as your set doesn't fall in size by around a factor of 2, this should work pretty well.

And if you need to handle more? Add the idea of "ghost" elements. When you delete a hash value, add a "ghost" hash value at where the 2m+1'th hash value would be expected to be. When you delete a real hash value, each "ghost" element has a random chance of being deleted with it matching the fraction of real elements that got deleted. If a ghost is deleted, you insert more. If you insert enough that a ghost gets too big to be in the smallest 2m, you let it fall off like any other value.

The resulting algorithm will need more memory, but handles adds and deletes. It should be reasonably accurate even if a large fraction of your data gets deleted.

score 0 · Answer 2 · answered Jun 02 '18 at 17:49

You can keep a separate HyperLogLog to count the deleted items, and substract to get the final count. However there are caveats with this approach: 1. Depending on your application, deletion might need to be counted only when the value has been seen before, which is not really possible to know. 2. Accuracy will degrade, depending on the pattern of adding and deleting elements.

Is it possible to dedup hyperloglog such that adding and deleting element will yield relatively correct unique count?

2 Answers2