Histogram approximation for streaming data

Question

This question is a slight extension of the one answered here. I am working on re-implementing a version of the histogram approximation found in Section 2.1 of this paper, and I would like to get all my ducks in a row before beginning this process again. Last time, I used boost::multi_index, but performance wasn't the greatest, and I would like to avoid the logarithmic in number of buckets insert/find complexity of a std::set. Because of the number of histograms I'm using (one per feature per class per leaf node of a random tree in a random forest), the computational complexity must be as close to constant as possible.

A standard technique used to implement a histogram involves mapping the input real value to a bin number. To accomplish this, one method is to:

initialize a standard C array of size N, where N = number of bins; and
multiply the input value (real number) by some factor and floor the result to get its index in the C array.

This works well for histograms with uniform bin size, and is quite efficient. However, Section 2.1 of the above-linked paper provides a histogram algorithm without uniform bin sizes.

Another issue is that simply multiplying the input real value by a factor and using the resulting product as an index fails with negative numbers. To resolve this, I considered identifying a '0' bin somewhere in the array. This bin would be centered at 0.0; the bins above/below it could be calculated using the same multiply-and-floor method just explained, with the slight modification that the floored product be added to two or subtracted from two as necessary.

This then raises the question of merges: the algorithm in the paper merges the two closest bins, as measured from center to center. In practice, this creates a 'jagged' histogram approximation, because some bins would have extremely large counts and others would not. Of course, this is due to non-uniform-sized bins, and doesn't result in any loss of precision. A loss of precision does, however, occur if we try to normalize the non-uniform-sized bins to make the uniform. This is because of the assumption that m/2 samples fall to the left and right of the bin center, where m = bin count. We could model each bin as a gaussian, but this will still result in a loss of precision (albeit minimal)

So that's where I'm stuck right now, leading to this major question: What's the best way to implement a histogram accepting streaming data and storing each sample in bins of uniform size?

score 6 · Accepted Answer · answered Dec 03 '11 at 17:11

Keep four variables.

int N;  // assume for simplicity that N is even
int count[N];
double lower_bound;
double bin_size;

When a new sample x arrives, compute double i = floor(x - lower_bound) / bin_size. If i >= 0 && i < N, then increment count[i]. If i >= N, then repeatedly double bin_size until x - lower_bound < N * bin_size. On every doubling, adjust the counts (optimize this by exploiting sparsity for multiple doublings).

for (int j = 0; j < N / 2; j++) count[j] = count[2 * j] + count[2 * j + 1];
for (int j = N / 2; j < N; j++) count[j] = 0;

The case i < 0 is trickier, since we need to decrease lower_bound as well as increase bin_size (again, optimize for sparsity or adjust the counts in one step).

while (lower_bound > x) {
    lower_bound -= N * bin_size;
    bin_size += bin_size;
    for (int j = N - 1; j > N / 2 - 1; j--) count[j] = count[2 * j - N] + count[2 * j - N + 1];
    for (int j = 0; j < N / 2; j++) count[j] = 0;
}

The exceptional cases are expensive but happen only a logarithmic number of times in the range of your data over the initial bin size.

If you implement this in floating-point, be mindful that floating-point numbers are not real numbers and that statements like lower_bound -= N * bin_size may misbehave (in this case, if N * bin_size is much smaller than lower_bound). I recommend that bin_size be a power of the radix (usually two) at all times.

Histogram approximation for streaming data

1 Answers1