I am training an org.apache.spark.mllib.recommendation.ALS
model on an quite big RDD rdd
. I'd like to select a decent regularization hyperparameter so that my model doesn't over- (or under-) fit. To do so, I split rdd
(using randomSplit
) into a train set and a test set and perform a cross-validation with a defined set of hyperparameters on these.
As I'm using the train and test RDDs several times in the cross-validation it seems natural to cache()
the data at some point for faster computation. However, my Spark knowledge is quite limited and I'm wondering which of these two options is better (and why):
Cache the initial RDD
rdd
before splitting it, that is:val train_proportion = 0.75 val seed = 42 rdd.cache() val split = rdd.randomSplit(Array(train_proportion, 1 - train_proportion), seed) val train_set = split(0) val test_set = split(1)
Cache the
train
andtest
RDDs after splitting the initial RDD:val train_proportion = 0.75 val seed = 42 val split = rdd.randomSplit(Array(train_proportion, 1 - train_proportion), seed) val train_set = split(0).cache() val test_set = split(1).cache()
My speculation is that option 1 is better because the randomSplit
would also benefit from the fact that rdd
is cached, but I'm not sure whether it would negatively impact the (multiple) future accesses to train_set
and test_set
with respect to option 2.
This answer seems to confirm my intuition, but it received no feedback, so I'd like to be sure by asking here.
What do you think? And more importantly: Why?
Please note that I have run the experiment on a Spark cluster, but it is often busy these days so my conclusions may be wrong. I also checked the Spark documentation and found no answer to my question.