9

I am looking for an algorithm to generate a histogram over a large amount of streaming data, the max and min are not known in advance but standard deviation and mean are in a particular range.

I appreciate your ideas.

Cheers,

Ali Salehi
  • 6,899
  • 11
  • 49
  • 75
  • what's an approximate histogram? – CharlesB Jun 17 '11 at 12:33
  • I meant I don't want to have the exact histogram (the number of elements in each bucket doesn't need to be exact). – Ali Salehi Jun 18 '11 at 03:07
  • See also http://stackoverflow.com/questions/2464871/numpy-histogram-of-large-arrays – mtrw Jun 18 '11 at 05:16
  • The propose python solution, requires knowing min/max in advance, or one pass through the data, if min/max changes the whole histogram has to be re calculated as bins change, this is not useful for online histogram calculation. Thanks for the suggestion anyway. Please see my answer below. – Ali Salehi Jun 19 '11 at 01:04
  • Doesn't this question belong in cs.stackexchange.com ? – einpoklum Dec 14 '13 at 11:58

3 Answers3

6

I just found one solution. Sec. 2.2 of "On-line histogram building from A streaming parallel decision tree algorithm" paper. The algo is implemented by NumericHistogram class in Hive project :

A generic, re-usable histogram class that supports partial aggregations. The algorithm is a heuristic adapted from the following paper: Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm", J. Machine Learning Research 11 (2010), pp. 849--872. Although there are no approximation guarantees, it appears to work well with adequate data and a large (e.g., 20-80) number of histogram bins.

Ali Salehi
  • 6,899
  • 11
  • 49
  • 75
  • yes. and this is a pot about the paper which explains in simple pictures https://www.vividcortex.com/blog/2013/07/08/streaming-approximate-histograms/ – Felipe Dec 11 '19 at 15:52
2

I use a package called "GoHistogram" which provides two streaming approximattion histograms (NumericHistogram and Weighted Numeric Histogram). It is implemented in Golang (https://code.google.com). Here is the link:

https://github.com/VividCortex/gohistogram

user2077168
  • 109
  • 1
  • 2
1

Standard deviation and mean do not matter for a histogram. Simply choose your resolution and draw a bar as high as you have hits for its range. This will, of course, get more expensive with a higher resolution. You can try adjusting the resolution by trying to fit the existing data into a normal curve (or whatever model you like) and finding the standard deviation to choose a reasonable granularity.

Edit: Read it wrong the first time around. If you know the approximate standard deviation, you can choose reasonable sizes for your histogram groups from the get-go. Just compare every new entry to your current min and max and adjust your range accordingly.