Calculating a histogram on a streaming data - Online histogram calculation

Question

I am looking for an algorithm to generate a histogram over a large amount of streaming data, the max and min are not known in advance but standard deviation and mean are in a particular range.

I appreciate your ideas.

Cheers,

I meant I don't want to have the exact histogram (the number of elements in each bucket doesn't need to be exact). — Ali Salehi, Jun 18 '11 at 03:07
See also http://stackoverflow.com/questions/2464871/numpy-histogram-of-large-arrays — mtrw, Jun 18 '11 at 05:16
The propose python solution, requires knowing min/max in advance, or one pass through the data, if min/max changes the whole histogram has to be re calculated as bins change, this is not useful for online histogram calculation. Thanks for the suggestion anyway. Please see my answer below. — Ali Salehi, Jun 19 '11 at 01:04

score 6 · Accepted Answer · answered Jun 18 '11 at 04:48

I just found one solution. Sec. 2.2 of "On-line histogram building from A streaming parallel decision tree algorithm" paper. The algo is implemented by NumericHistogram class in Hive project :

A generic, re-usable histogram class that supports partial aggregations. The algorithm is a heuristic adapted from the following paper: Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm", J. Machine Learning Research 11 (2010), pp. 849--872. Although there are no approximation guarantees, it appears to work well with adequate data and a large (e.g., 20-80) number of histogram bins.

yes. and this is a pot about the paper which explains in simple pictures https://www.vividcortex.com/blog/2013/07/08/streaming-approximate-histograms/ — Felipe, Dec 11 '19 at 15:52

score 2 · Answer 2 · answered Nov 12 '14 at 16:20

I use a package called "GoHistogram" which provides two streaming approximattion histograms (NumericHistogram and Weighted Numeric Histogram). It is implemented in Golang (https://code.google.com). Here is the link:

https://github.com/VividCortex/gohistogram

score 1 · Answer 3 · 2011-06-17T16:00:14.317

Standard deviation and mean do not matter for a histogram. Simply choose your resolution and draw a bar as high as you have hits for its range. This will, of course, get more expensive with a higher resolution. You can try adjusting the resolution by trying to fit the existing data into a normal curve (or whatever model you like) and finding the standard deviation to choose a reasonable granularity.

Edit: Read it wrong the first time around. If you know the approximate standard deviation, you can choose reasonable sizes for your histogram groups from the get-go. Just compare every new entry to your current min and max and adjust your range accordingly.

Calculating a histogram on a streaming data - Online histogram calculation

3 Answers3

Linked