1

I am training an org.apache.spark.mllib.recommendation.ALS model on an quite big RDD rdd. I'd like to select a decent regularization hyperparameter so that my model doesn't over- (or under-) fit. To do so, I split rdd (using randomSplit) into a train set and a test set and perform a cross-validation with a defined set of hyperparameters on these.

As I'm using the train and test RDDs several times in the cross-validation it seems natural to cache() the data at some point for faster computation. However, my Spark knowledge is quite limited and I'm wondering which of these two options is better (and why):

  1. Cache the initial RDD rdd before splitting it, that is:

    val train_proportion = 0.75
    val seed = 42
    rdd.cache()
    val split = rdd.randomSplit(Array(train_proportion, 1 - train_proportion), seed)
    val train_set = split(0)
    val test_set = split(1)
    
  2. Cache the train and test RDDs after splitting the initial RDD:

    val train_proportion = 0.75
    val seed = 42
    val split = rdd.randomSplit(Array(train_proportion, 1 - train_proportion), seed)
    val train_set = split(0).cache()
    val test_set = split(1).cache()
    

My speculation is that option 1 is better because the randomSplit would also benefit from the fact that rdd is cached, but I'm not sure whether it would negatively impact the (multiple) future accesses to train_set and test_set with respect to option 2. This answer seems to confirm my intuition, but it received no feedback, so I'd like to be sure by asking here.

What do you think? And more importantly: Why?

Please note that I have run the experiment on a Spark cluster, but it is often busy these days so my conclusions may be wrong. I also checked the Spark documentation and found no answer to my question.

Community
  • 1
  • 1
  • 2
    If `rdd` is not very expensive to create then caching after split seems like a better choice. But it is not something that you can really judge in isolation if we haven't seen your code or execution stats. ALS for example is using intensive caching and checkpointing anyway. – zero323 May 31 '16 at 08:03
  • 1
    The important is not whether more computation will be done afterwards. You want to know how many times you will execute the lineage from a given point. If you execute a single linear lineage from your training set, apply a heavy sequence of transformations and never execute a new lineage from the original training set, cache() will not be useful. Since we don't know exactly what you are doing with your training and testing set after, we can't answer the question. Due to the cross-validation thing, it would seem to me that you could as suggested by mark91 cache both before and after – Pascal Soucy May 31 '16 at 13:49
  • Thank you for your answers. @psoucy: All I'm doing afterwards is, for a set of regularization parameters, training ALS on `train_set` and then predicting the ratings on `test_set`. I also compute the MSE and MAE to evaluate my results and pick the best regularization parameter so I guess I have at least one lineage performed on `test_set` per hyperparameter, which justifies caching it. However, as I don't know well how ALS is implemented, I'm still not sure whether caching `train_set` is a good idea or not. – Alexis Zubiolo Jun 01 '16 at 07:04
  • 2
    @AlexisZubiolo yeah I'm pretty that in this case cache() would be useful. If you look at this example http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html they persist() all sets before sending to train() – Pascal Soucy Jun 01 '16 at 14:53

1 Answers1

0

If the calculation on the RDD are made before the split, than it is better to cache it before, as (in my experience) all the transformation will be run only once and triggered by the cache() action.
I suppose split() cache() cache() are 3 actions vs cache() split() 2.
EDIT: cache is not an action. And indeed I find a confirmation in other similar questions around the web
Edit: to clarify my first sentence: the DAG will perform all the transformation on the RDD and then cache it, so all the things done to it afterwards will need no more computation, although the splitted parts will be calculated again.
In conclusion, should you operate heavier transformations on the splitted part than the original RDD itself, you would want to cache them instead. (I hope someone will back me up here)

Community
  • 1
  • 1
Vale
  • 1,104
  • 1
  • 10
  • 29
  • 1
    cache isn't an action at all! – mgaido May 31 '16 at 09:22
  • T_T kill me now, I understood nothing in these days – Vale May 31 '16 at 09:23
  • @mark91 I see in the documentation that cache isn't in the actions. I think my conclusion still stands, though – Vale May 31 '16 at 09:28
  • 1
    actually maybe the best thing to do is to cache the RDD before the split and than cache the other two RDDs after the split, releasing at this point the initial cahced RDD in order to avoid the recomputation of the filter every time the two splits are used...but it depends on how much memory you have and maybe the best thing to do is to do some trials.. – mgaido May 31 '16 at 09:41
  • 1
    Thanks a lot for your feedbacks. @mark91: I was wondering whether (and why) there is an _optimal_ timing to cache my data, but it seems that it is highly case-dependent. I will run experiments with all the possible options and pick the one with the best empirical performance. – Alexis Zubiolo Jun 01 '16 at 07:20