2

I need to find arbitrary quantiles of a large stream of data (does't fit in memory) and the results need to be repeatable i.e for the same stream the results should be identical. I have been using colt for this and the results are not repeatable.

Is there another library out there that passes these requirements?

What do I have to do to make results of quantile binning repeatable with colt (I'm using 1.2.0)? I've used a random seed in my random numbers but it looks like colt introduces its own randomness. I can't figure out.

I get the following results for two different runs. If they were repeatable, the results would be the same:

[0.0990242124295947, 0.20014652659912247, 0.2996443961549412]
[0.09994965676310263, 0.20079195488768953, 0.29986981667267676]

Here is the code that generates it:

public class QuantileTest {

    public static void main(String[] args) throws IOException, Exception {
        QuantileBin1D qBins = new QuantileBin1D(false, Long.MAX_VALUE, 0.001, 0.0001, 64, null);
        Random rand = new Random(0);
        for (int i = 0; i < 1500000; i++) {
            double num = rand.nextDouble();;
            qBins.add(num);
        }

        DoubleArrayList qMarks = new DoubleArrayList(new double[] {0.1, 0.2, 0.3});
        double[] xMarks = qBins.quantiles(qMarks).elements();
        System.out.println(Arrays.toString(xMarks));
    }
}
fodon
  • 4,565
  • 12
  • 44
  • 58
  • You sure it's not just a question of precision? The numbers are so close to being the same that this would be my gut instinct. – Roddy of the Frozen Peas Sep 18 '12 at 16:17
  • Also you've set your epsilon to 0.001. That's the approximation error that will never be exceeded, and it seems like all of your numbers are in fact equal through the 10^-3 digit. If you don't want Colt to use approximation, the docs say to use 0.0 as your epsilon. – Roddy of the Frozen Peas Sep 18 '12 at 16:19
  • If there were no randomness the results would be identical for identical inputs. The precision quantifies the difference w.r.t true values if you actually sorted all the numbers. Repetability is about getting the same results no matter how many times you run it. – fodon Sep 18 '12 at 17:19

1 Answers1

1

There is still some randomness as you do not supply a RandomEngine to the QuantileBin1D. In some classes (RandomSampler was the first occurence I found) a default RandomEngine will be created which seems to be not repeatable.

if (randomGenerator==null) randomGenerator = cern.jet.random.AbstractDistribution.makeDefaultGenerator();
    this.my_RandomGenerator=randomGenerator;

You should change the constructor to new QuantileBin1D(false, Long.MAX_VALUE, 0.001, 0.0001, 64, new DRand());

with cern.jet.random.engine.DRand were the default constructor is documented with

Constructs and returns a random number generator with a default seed, which is a constant.

This should lead to non-random results.

Uwe L. Korn
  • 8,080
  • 1
  • 30
  • 42