1

My platform is spark 2.1.0, using python language.

Now I have about 100 random forest multiclassification models ,I have saved them in the HDFS.There are 100 datasets saved in the HDFS too. I want to predict the dataset using corresponding model.If the models and datasets are cache in memory,the predict will be more than 10 times faster.

But I do not know how to cache models because the model is not RDD or Dataframe.

Thanks!

molbdnilo
  • 64,751
  • 3
  • 43
  • 82
Guanglin Zhou
  • 21
  • 1
  • 5

1 Answers1

2

TL;DR Just cache the data, if it is ever reused outside prediction process, and if not you can even skip that.

RandomForestModel is a local object not backed by distributed data structures, there is no DAG to recompute, and prediction process is a simple, map-only job. Therefore model cannot be cached and even if it could, the operation would be meaningless.

See also (Why) do we need to call cache or persist on a RDD

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115