I have a large list of values, drawn from the range 0 to 100,000 (represented here as letters for clarity). There might be a few thousand items in each input.
[a a a a b b b b c f d b c f ... ]
I want to find the count of numbers with counts over certain threshold. For example, if the threshold is 3, the answer is {a: 4, b: 5}
.
The obvious way to do this is to group by identity, count each grouping and then filter.
This is a language agnostic question, but in Clojure (don't be put off if you don't know Clojure!):
(filter (fn [[k cnt]] (> cnt threshold)) (frequencies input))
This function runs over a very large number of inputs, each input is very large, so the grouping and filtering is an expensive operation. I want to find some kind of guard function that will return early if the input can never produce any outputs over the given threshold or otherwise partition the problem space. For example, the most simplistic is if the size of the input is less than the size of the threshold return nil
.
I'm looking for a better guard function that will skip the computation if the input can't produce any outputs. Or a quicker way to produce the output.
Obviously it has to be less expensive than the grouping itself. One great solution involved the count of the input by the distinct set of inputs but that ended up being as expensive as grouping...
I have an idea that probabilistic data structures might hold the key. Any ideas?
(I tagged hyerloglog, although I don't think it applies because it doesn't provide counts)