how to cache random forest models in spark

Question

My platform is spark 2.1.0, using python language.

Now I have about 100 random forest multiclassification models ,I have saved them in the HDFS.There are 100 datasets saved in the HDFS too. I want to predict the dataset using corresponding model.If the models and datasets are cache in memory,the predict will be more than 10 times faster.

But I do not know how to cache models because the model is not RDD or Dataframe.

Thanks!

Alper t. Turker · Accepted Answer · 2018-05-29T18:26:21.477

TL;DR Just cache the data, if it is ever reused outside prediction process, and if not you can even skip that.

RandomForestModel is a local object not backed by distributed data structures, there is no DAG to recompute, and prediction process is a simple, map-only job. Therefore model cannot be cached and even if it could, the operation would be meaningless.

See also (Why) do we need to call cache or persist on a RDD

how to cache random forest models in spark

1 Answers1