I have been looking for away to generate bins for specific dataset (by specifying lower band, upper band and number of bins required) using apache common math 3.0. I have looked at Frequency http://commons.apache.org/math/apidocs/org/apache/commons/math3/stat/Frequency.html but it does not give me what i want.. i want a method that give me frequency for values in an interval ( ex: how many values are between 0 to 5). Any suggestions or ideas?
-
Are you restricted to Apache? This sounds exactly like the use case for [Guava's](http://guava-libraries.googlecode.com) [`SortedMultiset`](http://code.google.com/p/guava-libraries/wiki/NewCollectionTypesExplained#SortedMultiset). – Louis Wasserman May 28 '12 at 14:51
-
@ Louis Wasserman yes I'm restricted to Apache math 3.0, because it provide other fitting and interpolation functionality. – Sami May 28 '12 at 14:59
-
If you're using a more recent version of Java you can do this using the Java Streams API. See [answer](https://stackoverflow.com/a/67979195/2049647) below. – ATutorMe Jun 15 '21 at 02:21
5 Answers
Here is a simple way to implement histogram using Apache Commons Math 3:
final int BIN_COUNT = 20;
double[] data = {1.2, 0.2, 0.333, 1.4, 1.5, 1.2, 1.3, 10.4, 1, 2.0};
long[] histogram = new long[BIN_COUNT];
org.apache.commons.math3.random.EmpiricalDistribution distribution = new org.apache.commons.math3.random.EmpiricalDistribution(BIN_COUNT);
distribution.load(data);
int k = 0;
for(org.apache.commons.math3.stat.descriptive.SummaryStatistics stats: distribution.getBinStats())
{
histogram[k++] = stats.getN();
}

- 1,226
- 1
- 14
- 23
-
3It is possible to get the interval borders from EmpiricalDistribution#getUpperBounds as well. – Till Schäfer Nov 30 '15 at 17:04
-
Does Commons Math provide a function that suggests a "good" number of bins depending on the size of the population you're binning? – L. Blanc Apr 07 '19 at 17:04
-
From the doc USAGE NOTES: The binCount is set by default to 1000. A good rule of thumb is to set the bin count to approximately the length of the input file divided by 10. The input file must be a plain text file containing one valid numeric entry per line. See https://commons.apache.org/proper/commons-math/javadocs/api-3.6/org/apache/commons/math3/random/EmpiricalDistribution.html – greg Dec 06 '19 at 13:40
As far as I know there is no good histogram class in Apache Commons. I ended up writing my own. If all you want are linearly distributed bins from min to max, then it is quite easy to write.
Maybe something like this:
public static int[] calcHistogram(double[] data, double min, double max, int numBins) {
final int[] result = new int[numBins];
final double binSize = (max - min)/numBins;
for (double d : data) {
int bin = (int) ((d - min) / binSize);
if (bin < 0) { /* this data is smaller than min */ }
else if (bin >= numBins) { /* this data point is bigger than max */ }
else {
result[bin] += 1;
}
}
return result;
}
Edit: Here's an example.
double[] data = { 2, 4, 6, 7, 8, 9 };
int[] histogram = calcHistogram(data, 0, 10, 4);
// This is a histogram with 4 bins, 0-2.5, 2.5-5, 5-7.5, 7.5-10.
assert histogram[0] == 1; // one point (2) in range 0-2.5
assert histogram[1] == 1; // one point (4) in range 2.5-5.
// etc..

- 15,698
- 9
- 48
- 66

- 3,384
- 2
- 27
- 26
-
but how will get the frequency for each bin? I did not find any class or method that does that in Apache Math 3.0. – Sami May 28 '12 at 15:09
-
Frequency for each bin? `result[i]` gives you how many data points are in the `i`-th bin. If you want frequency (proportion), simply do `result[i] / data.length`... – Max May 28 '12 at 15:20
-
Max, I think your code has a small bug in it ... see my correction posted below. Thanks. – user1172468 Sep 18 '12 at 02:25
-
-
I think your code has a bug in it -- please see the corrected code below:
public static int[] calcHistogram(double[] data, double min, double max, int numBins) {
final int[] result = new int[numBins];
final double binSize = (max - min)/numBins;
for (double d : data) {
int bin = (int) ((d - min) / binSize); // changed this from numBins
if (bin < 0) { /* this data is smaller than min */ }
else if (bin >= numBins) { /* this data point is bigger than max */ }
else {
result[bin] += 1;
}
}
return result;
}

- 5,306
- 6
- 35
- 62
This is in addition to @Altair7852's answer.
If you want to generate x values bin interval
for your y values (the frequency in each bin..akahistogram[] at index i)
here is the full method
private fun displayHistogram(binCount: Int, data: DoubleArray) {
val histogram = DoubleArray(binCount)
val distribution = org.apache.commons.math3.random.EmpiricalDistribution(binCount)
distribution.load(data)
var k = 0
for (stats in distribution.binStats) {
histogram[k++] = stats.n.toDouble()
}
val binSize = (data.max()!!.toDouble() - data.min()!!.toDouble()) / binCount
for (i in 0 until histogram.size) {
series2?.appendData(DataPoint(generateHistogramXValues(data.min()!!.toDouble(), histogram.size, binSize)[i], histogram[i]), false, histogram.count())
}
}
Here is the x values generating method
val xValuesArray = DoubleArray(numberOfBIns)
for (i in 0 until numberOfBIns) {
if (i == 0){
xValuesArray[i] = min
}else{
val previous = xValuesArray[i-1]
xValuesArray[i] = previous+binSize
}
}
return xValuesArray
}
I'm doing this on android using GraphView
graphing library but you can use this on any lib.

- 1,233
- 2
- 9
- 12
Here's a Java streams based implementation of the same function.
Uses some useful range, filter and count functions.
public static Long[] calcHistogram(Double[] data, Double min, Double max, Integer numBins) {
final var interval = (max - min) / numBins;
return IntStream.range(0, numBins)
.boxed()
.map(n -> {
var binStart = min + n * interval;
var binEnd = min + (n + 1) * interval;
return Arrays.stream(data).filter(d -> d >= binStart && d < binEnd).count();
})
.toArray(Long[]::new);
}

- 820
- 8
- 14