How does Spark keep track of the splits in randomSplit?

Question

This question explains how Spark's random split works, How does Sparks RDD.randomSplit actually split the RDD, but I don't understand how spark keeps track of what values went to one split so that those same values don't go to the second split.

If we look at the implementation of randomSplit:

def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame] = {
 // It is possible that the underlying dataframe doesn't guarantee the ordering of rows in its
 // constituent partitions each time a split is materialized which could result in
 // overlapping splits. To prevent this, we explicitly sort each input partition to make the
 // ordering deterministic.

 val sorted = Sort(logicalPlan.output.map(SortOrder(_, Ascending)), global = false, logicalPlan)
 val sum = weights.sum
 val normalizedCumWeights = weights.map(_ / sum).scanLeft(0.0d)(_ + _)
 normalizedCumWeights.sliding(2).map { x =>
  new DataFrame(sqlContext, Sample(x(0), x(1), withReplacement = false, seed, sorted))
}.toArray
}

we can see that it creates two DataFrames that share the same sqlContext and with two different Sample(rs).

How are these two DataFrame(s) communicating with each other so that a value that fell in the first one is not included in the second one?

And is the data being fetched twice? (Assume the sqlContext is selecting from a DB, is the select being executed twice?).

Kien Truong · Accepted Answer · 2016-07-18T15:21:27.567

It's exactly the same as sampling an RDD.

Assuming you have the weight array (0.6, 0.2, 0.2), Spark will generate one DataFrame for each range (0.0, 0.6), (0.6, 0.8), (0.8, 1.0).

When it's time to read the result DataFrame, Spark will just go over the parent DataFrame. For each item, generate a random number, if that number fall in the the specified range, then emit the item. All child DataFrame share the same random number generator (technically, different generators with the same seed), so the sequence of random number is deterministic.

For your last question, if you did not cache the parent DataFrame, then the data for the input DataFrame will be re-fetch each time an output DataFrame is computed.

I would emphasize that a whole trick is to use the same seed for each `Sample`. — zero323, Jul 14 '16 at 17:14

How does Spark keep track of the splits in randomSplit?

1 Answers1

Linked