I want to save to disk an sklearn Pipeline including a custom Preprocessing and a RandomForestClassifier with all the dependencies inside the saved file.. Without this feature, I have to copy all the dependencies (custom modules) in the same folder everywhere I want to call this model (in my case on a remote server).
The preprocessor is defined in a class which lies in an other file (preprocessing.py) in the same folder of my project. So I get access to it through an import.
training.py
from preprocessing import Preprocessor
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
import pickle
clf = Pipeline([
("preprocessing", Preprocessor()),
("model", RandomForestClassifier())
])
# some fitting of the classifier
# ...
# Export
with open(savepath, "wb") as handle:
pickle.dump(clf, handle, protocol=pickle.HIGHEST_PROTOCOL)
I tried pickle (and some of its variations), dill and joblib, but that did not work. When I import the .pkl somewhere else (say on my remote server). I must have an identical preprocessing.py in the architecture... which is a pain.
What I would love is to have another file somewhere else :
remote.py
import pickle
with open(savepath, "rb") as handle:
model = pickle.load(handle)
print(model.predict(some_matrix))
But this code currently gives me an error as it does not find the Preprocessor class...