Questions tagged [hyperloglog]

Hyperloglog is an approximate technique for computing the number of distinct entries in a set.

Hyperloglog is an approximate technique for computing the number of distinct entries in a set implemented in Algebird, a scala library for abstract algebra. This can be used in Summingbird to create MapReduce programs for estimating cardinalities of large datasets in streaming (online) or batch (offline) mode. Data structure store Redis also has HyperLogLog implementation.

89 questions
204
votes
3 answers

How does the HyperLogLog algorithm work?

I've been learning about different algorithms in my spare time recently, and one that I came across which appears to be very interesting is called the HyperLogLog algorithm - which estimates how many unique items are in a list. This was particularly…
K2xL
  • 9,730
  • 18
  • 64
  • 101
59
votes
6 answers

LogLog and HyperLogLog algorithms for counting of large cardinalities

Where can I find a valid implementation of LogLog algorithm? Have tried to implement it by myself but my draft implementation yields strange results. Here it is: function LogLog(max_error, max_count) { function log2(x) { return…
actual
  • 2,370
  • 1
  • 21
  • 32
14
votes
2 answers

Applying HyperLogLog to a sample of the population

The HyperLogLog algorithm by Flajolet et al describes a clever way to estimate the cardinality of a set using only a tiny amount of memory. However, it does take into account all N elements of the original set in the calculation. What if we had…
Jon Smark
  • 2,528
  • 24
  • 31
10
votes
2 answers

Best Method to Intersect Huge HyperLogLogs in Redis

The problem is simple: I need to find the optimal strategy to implement accurate HyperLogLog unions based on Redis' representation thereof--this includes handling their sparse/dense representations if the data structure is exported for use…
Julian
  • 1,406
  • 1
  • 13
  • 24
7
votes
1 answer

Is it possible to decrement a HyperLogLog set in Redis

Let's say that i have a hyperloglog in redis which counts messages is there any provisions whereby I can to some degree account for delete messages?
Jack
  • 1,901
  • 1
  • 19
  • 32
5
votes
2 answers

Is it possible to dedup hyperloglog such that adding and deleting element will yield relatively correct unique count?

If I want to get the unique count in a list of element that can be added and deleted, is there a way to do that? For example add key1 delete key1 add key1 should give a unique count of 1 but if I have a naive method of 2 hll one for delete and one…
Jal
  • 2,174
  • 1
  • 18
  • 37
5
votes
1 answer

Reliable integration test for code using HyperLogLog?

We are using Twitter's implementation of HyperLogLog in Algebird. Given a number N and a check in our system that uses HyperLogLog to estimate the current size of a gradually-growing collection and test if it is more or less than N, how can we write…
Robin Green
  • 32,079
  • 16
  • 104
  • 187
4
votes
1 answer

Is it possible in clickhouse to store a HyperLogLog / uniqState() state directly trough an insert query?

We can use an AggregatedMergeTree table engine, which can be used for a aggregating rows. Generally in aggregated data we are not interested in storing all unique identifiers and still want to do a count distinct. Still we want to have the ability…
RoyB
  • 3,104
  • 1
  • 16
  • 37
4
votes
3 answers

Fast way to estimate item counts above a given threshold? Probabilistic data structure?

I have a large list of values, drawn from the range 0 to 100,000 (represented here as letters for clarity). There might be a few thousand items in each input. [a a a a b b b b c f d b c f ... ] I want to find the count of numbers with counts over…
Joe
  • 46,419
  • 33
  • 155
  • 245
4
votes
2 answers

Efficient distributed counting

I have a series of events flowing through a system (e.g a pizza ordering system) and I want to count certain properties of each event through time. For example, I might want to see how many unique people ordered pepperoni pizza in the last 5…
Sam
  • 1,246
  • 1
  • 19
  • 27
3
votes
1 answer

Get Aerospike hyperLogLog(HLL) intersection count of multiple HLL unions

I have 2 or more HLLs that are unioned, I want to get the intersection count of that unions. I have used the example from here hll-python example Following is my code ops = [hll_ops.hll_get_union(HLL_BIN, records)] _, _, result1 =…
darekarsam
  • 181
  • 2
  • 12
3
votes
1 answer

hyperlog-android not all logs are send to the server . how to solve this?

File file = HyperLog.getDeviceLogsInFile(this); HyperLog.setURL("url"); HyperLog.pushLogs(this, file.getAbsolutePath(), false, new HLCallback() { @Override public void onSuccess(@NonNull Object response) { …
3
votes
1 answer

Efficient (constant space or sublinear space) way to find out the cardinality of ( (A intersect B) union C ) intersect D )?

I am currently using hyperloglog to estimate the cardinality of sets (# of unique items) Its quite trivial to calculate the cardinality for the union of 2 sets and the cardinality for the intersection of 2 sets (|A intersect B| = |A| + |B| - |A…
Jal
  • 2,174
  • 1
  • 18
  • 37
3
votes
2 answers

BigQuery: How to merge HLL Sketches over a window function? (Count distinct values over a rolling window)

Example relevant table schema: +---------------------------+-------------------+ | activity_date - TIMESTAMP | user_id - STRING | +---------------------------+-------------------+ | 2017-02-22 17:36:08 UTC | fake_id_i24385787…
Ivan
  • 33
  • 6
3
votes
1 answer

How to clear the values of a key in Redis HyperLogLog

I'm using Redis implementation of HyperLogLog to count distinct values for given keys. The keys are based on hour window. After the calendar hour changes, I want to reset the count of incoming values. I don't see any direct API for 'clearing' up the…
RRM
  • 2,495
  • 29
  • 46
1
2 3 4 5 6