Spark - does sampleByKeyExact duplicate the data?

Question

When performing sampleByKeyExact on a JavaPairRDD, does Spark save an actual copy of the data or pointers to the JavaPairRDD?

Meaning, if I perform 100 bootstrap sampling of the original dataset - does it keep 100 copies of the original RDD or keep 100 different indices with pointers?

UPDATE:

JavaPairRDD<String, String> dataPairs = ... // Load the data
boolean withReplacement = true; 
double testFraction =  0.2;
long seed = 0;
Map classFractions = new HashMap(); 
classFractions.put("1", 1 - testFraction);
classFractions.put("0", 1 - testFraction);

dataPairs.cache();    

for (1:100) 
{
    PredictionAlgorithm algo = new Algo();

    JavaPairRDD<String, String> trainStratifiedData = dataPairs.sampleByKeyExact(withReplacement, classFractions, seed);

    algo.fit(trainStratifiedData);

}

RDD doesn't contain data so it neither one. When task is executed all depends on a context. If data is fetched from cache it can be references to the same objects. If RDD is recomputed then no. Typically different samples want be materialized at the same time. — zero323, Feb 15 '16 at 15:37
RDD is just a recipe. If it is not explicitly or implicitly persisted it will be recomputed from scratch every time you call an action which depends on it. — zero323, Feb 15 '16 at 16:08
@zero323 Thank you. Just to make sure I understand, I edited the question with code example. What exactly will happen for every cache() in the loop? trainStratifiedData is a reference to dataPairs or a copy? — Serendipity, Feb 15 '16 at 16:46
It is not problem what will happen with `trainStratifiedData`, but with ` `dataPairs`. Every time you call (`algo.fit`) it can be loaded from scratch (assuming no [implicit caches](http://stackoverflow.com/a/34581152/1560062) or explicit caches). — zero323, Feb 15 '16 at 23:09
@zero323 so instead of `trainStratifiedData.cache();` in need to do `dataPairs.cache();` before executing any operation on any of the samples created? — Serendipity, Feb 16 '16 at 07:59

Spark - does sampleByKeyExact duplicate the data?

0 Answers0