0

When performing sampleByKeyExact on a JavaPairRDD, does Spark save an actual copy of the data or pointers to the JavaPairRDD?

Meaning, if I perform 100 bootstrap sampling of the original dataset - does it keep 100 copies of the original RDD or keep 100 different indices with pointers?

UPDATE:

JavaPairRDD<String, String> dataPairs = ... // Load the data
boolean withReplacement = true; 
double testFraction =  0.2;
long seed = 0;
Map classFractions = new HashMap(); 
classFractions.put("1", 1 - testFraction);
classFractions.put("0", 1 - testFraction);

dataPairs.cache();    

for (1:100) 
{
    PredictionAlgorithm algo = new Algo();

    JavaPairRDD<String, String> trainStratifiedData = dataPairs.sampleByKeyExact(withReplacement, classFractions, seed);

    algo.fit(trainStratifiedData);

}
Serendipity
  • 2,216
  • 23
  • 33
  • RDD doesn't contain data so it neither one. When task is executed all depends on a context. If data is fetched from cache it can be references to the same objects. If RDD is recomputed then no. Typically different samples want be materialized at the same time. – zero323 Feb 15 '16 at 15:37
  • @zero323 what do you mean "if RDD is recomputed"? – Serendipity Feb 15 '16 at 16:01
  • RDD is just a recipe. If it is not explicitly or implicitly persisted it will be recomputed from scratch every time you call an action which depends on it. – zero323 Feb 15 '16 at 16:08
  • @zero323 Thank you. Just to make sure I understand, I edited the question with code example. What exactly will happen for every cache() in the loop? trainStratifiedData is a reference to dataPairs or a copy? – Serendipity Feb 15 '16 at 16:46
  • It is not problem what will happen with `trainStratifiedData`, but with ` `dataPairs`. Every time you call (`algo.fit`) it can be loaded from scratch (assuming no [implicit caches](http://stackoverflow.com/a/34581152/1560062) or explicit caches). – zero323 Feb 15 '16 at 23:09
  • @zero323 so instead of `trainStratifiedData.cache();` in need to do `dataPairs.cache();` before executing any operation on any of the samples created? – Serendipity Feb 16 '16 at 07:59
  • It is probably a good idea. – zero323 Feb 16 '16 at 13:41

0 Answers0