11

I want to save to disk an sklearn Pipeline including a custom Preprocessing and a RandomForestClassifier with all the dependencies inside the saved file.. Without this feature, I have to copy all the dependencies (custom modules) in the same folder everywhere I want to call this model (in my case on a remote server).

The preprocessor is defined in a class which lies in an other file (preprocessing.py) in the same folder of my project. So I get access to it through an import.

training.py

from preprocessing import Preprocessor

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
import pickle

clf = Pipeline([
("preprocessing", Preprocessor()),
("model", RandomForestClassifier())
])

# some fitting of the classifier
# ...

# Export
with open(savepath, "wb") as handle:
    pickle.dump(clf, handle, protocol=pickle.HIGHEST_PROTOCOL)

I tried pickle (and some of its variations), dill and joblib, but that did not work. When I import the .pkl somewhere else (say on my remote server). I must have an identical preprocessing.py in the architecture... which is a pain.

What I would love is to have another file somewhere else :
remote.py

import pickle

with open(savepath, "rb") as handle:
     model = pickle.load(handle)

print(model.predict(some_matrix))

But this code currently gives me an error as it does not find the Preprocessor class...

maxJu
  • 111
  • 4

2 Answers2

3

I'm facing an identical issue right now. To address the same, I am going to turn my pipeline/model along with all it's dependencies(preprocessing classes) into a Python module using setup tools so that it is self contained and can be run anywhere (remote server/docker container/VM.

I'm currently going through this process and if this is something you are interested in, I can respond with the additional steps spelled out as I make progress.

gdv820
  • 181
  • 1
  • 4
-1

I am not sure what are the tools you are using, but mlflow has a features to address this issue, Which is pretty much saving all the dependency files as a package and when the model is deployed it is done so along with all its dependencies

Following along this post should help

Sai krishna
  • 49
  • 2
  • 3