I do not see how to do it with hyper log log, but I do see how to do it with less efficient cardinality estimators.
Here is a simple cardinality estimator, which you can find in http://www.cse.unsw.edu.au/~cs9314/07s1/lectures/Lin_CS9314_References/fm85.pdf. Calculate the hash value of each element. Keep the smallest m
hash values. Use the size of the m
th hash value to estimate the cardinality of the whole set. (Let's ignore hash collisions.)
Now here is an adaptation to handle some deletions. Keep the smallest 2m
hash values. Use the size of the m
'th smallest to estimate the cardinality of the whole set. If a hash element should be deleted, just remove it from the set. As long as your set doesn't fall in size by around a factor of 2, this should work pretty well.
And if you need to handle more? Add the idea of "ghost" elements. When you delete a hash value, add a "ghost" hash value at where the 2m+1
'th hash value would be expected to be. When you delete a real hash value, each "ghost" element has a random chance of being deleted with it matching the fraction of real elements that got deleted. If a ghost is deleted, you insert more. If you insert enough that a ghost gets too big to be in the smallest 2m
, you let it fall off like any other value.
The resulting algorithm will need more memory, but handles adds and deletes. It should be reasonably accurate even if a large fraction of your data gets deleted.