0

I trained a Sklearn RandomForestRegressor model on 19GB of training data. I would like to save it to disk in order to use it later for inference. As have been recomended in another stackoverflow questions, I tried the following:

  • Pickle
pickle.dump(model, open(filename, 'wb'))

Model was saved successfully. It's size on disk was 1.9 GB.

loaded_model = pickle.load(open(filename, 'rb'))

Loading of the model resulted in MemorError (despite 16 GB RAM)

  • cPickle - the same result as Pickle
  • Joblib

joblib.dump(est, 'random_forest.joblib' compress=3)

It also ends with the MemoryError while loading the file.

  • Klepto
d = klepto.archives.dir_archive('sklearn_models', cached=True, serialized=True)
d['sklearn_random_forest'] = est
d.dump()

Arhcive is created, but when I want to load it using the following code, I get the KeyError: 'sklearn_random_forest'

d = klepto.archives.dir_archive('sklearn_models', cached=True, serialized=True)
d.load(model_params)
est = d[model_params]

I tried saving dictionary object using the same code, and it worked, so the code is correct. Apparently Klepto cannot persist sklearn models. I played with cached and serialized parameters and it didn't help.

Any hints on how to handle this would be very appreciated. Is it possible to save the model in JSON, XML, maybe HDFS, or maybe other formats?

kedar
  • 45
  • 7

2 Answers2

2

Try using joblib.dump()

In this method, you can use the param "compress". This param takes in Integer values between 0 and 9, the higher the value the more compressed your file gets. Ideally, a compress value of 3 would suffice.

The only downside is that the higher the compress value slower the write/read speed!

Sahil_Angra
  • 131
  • 7
  • Hi. Thank you for your response. I have already tried using joblib.dump() with compress=3. Unfortunately similarly to pickle, the object size in RAM is the order of magnitude larger than its disk size, thus I am getting a MemoryError while loading using joblib. – kedar Jan 22 '21 at 15:15
1

The size of a Random Forest model is not strictly dependent on the size of the dataset that you trained it with. Instead, there are other parameters that you can see on the Random Forest classifier documentation which control how big the model can grow to be. Parameters like:

  • n_estimators - the number of trees
  • max_depth - how "tall" each tree can get
  • min_samples_split and min_samples_leaf - the number of samples that allow nodes in the tree to split/continue splitting

If you have trained your model with a high number of estimators, large max depth, and very low leaf/split samples, then your resulting model can be huge - and this is where you run into memory problems.

In these cases, I've often found that training smaller models (by controlling these parameters) -- as long as it doesn't kill the performance metrics -- will resolve this problem, and you can then fall back on joblib or the other solutions you mentioned to save/load your model.

neal
  • 343
  • 3
  • 10