Spark::KMeans calls takeSample() twice?

Question

I have many data and I have experimented with partitions of cardinality [20k, 200k+].

I call it like that:

from pyspark.mllib.clustering import KMeans, KMeansModel
C0 = KMeans.train(first, 8192, initializationMode='random', maxIterations=10, seed=None)
C0 = KMeans.train(second, 8192, initializationMode='random', maxIterations=10, seed=None)

and I see that initRandom() calls takeSample() once.

Then the takeSample() implementation doesn't seem to call itself or something like that, so I would expect KMeans() to call takeSample() once. So why the monitor shows two takeSample()s per KMeans()?

Note: I execute more KMeans() and they all invoke two takeSample()s, regardless of the data being .cache()'d or not.

Moreover, the number of partitions doesn't affect the number takeSample() is called, it's constant to 2.

I am using Spark 1.6.2 (and I cannot upgrade) and my application is in Python, if that matters!

I brought this to the mailing list of the Spark devs, so I am updating:

Details of 1st takeSample():

Details of 2nd takeSample():

where one can see that the same code is executed.

score 2 · Accepted Answer · answered Sep 01 '16 at 23:03

_{As suggested by Shivaram Venkataraman in Spark's mailing list:}

I think takeSample itself runs multiple jobs if the amount of samples collected in the first pass is not enough. The comment and code path at GitHub should explain when this happens. Also you can confirm this by checking if the logWarning shows up in your logs.

// If the first sample didn't turn out large enough, keep trying to take samples;
// this shouldn't happen often because we use a big multiplier for the initial size
var numIters = 0
while (samples.length < num) {
  logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
  samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
  numIters += 1
}

However, as one can see, the 2nd comment said it shouldn't happen often, and it does happen always to me, so if anyone has another idea, please let me know.

It was also suggested that this was a problem of the UI and takeSample() was actually called only once, but that was just hot air.

Spark::KMeans calls takeSample() twice?

1 Answers1

Linked