How is the sample obtained after random number generation?
Depending on a fraction you want to sample there are two different algorithms. You can check Justin's Pihony answer to SPARK Is sample method on Dataframes uniform sampling?
it gives me samples of different sizes every time I run it, though it work fine when I set the third parameter (seed). Why so?
If fraction is above RandomSampler.defaultMaxGapSamplingFraction
sampling is done by a simple filter:
items.filter { _ => rng.nextDouble() <= fraction }
otherwise, simplifying things a little bit, it is repeatedly calling drop
method using random integers and takes next item.
Keeping that in mind it should be obvious that a number of returned elements will be random with mean, assuming there is nothing wrong with GapSamplingIterator
, equal to fraction * rdd.count. If you set seed you get the same sequence of random numbers and as a consequence the same elements are included in the sample.