0

I have a random forest classifier stored in the object clf. In really simplified terms, I did the following:

# Import libraries
import pandas as pd
from import sklearn.ensemble import RandomForestClassifier as rfc

# Import data
exog = pd.read_csv('train.csv')
trgt = pd.read_csv('target.csv')

# Declare classifier
clf = rfc(n_estimators=51, bootstrap=True, max_features=3)

# Fit classifier to data
clf.fit(exog, trgt)

I would like to export clf so I can reference it in another script. My goal is to import clf into a Python script that will be running on a remote server. I want to input out-of-sample data into it and have it return their respective scores using clf.predict_proba(new_data).

My top priority is to avoid training the classifier every time I predict the probabilities for new datasets. Is there a way to export the tuned clf object?

This thread pointed me in the right direction, but the solution is using cPickle and it's throwing the following error:

TypeError: write() argument must be str, not bytes

Arturo Sbr
  • 5,567
  • 4
  • 38
  • 76
  • Does this answer your question? [Save classifier to disk in scikit-learn](https://stackoverflow.com/questions/10592605/save-classifier-to-disk-in-scikit-learn) – Arco Bast Mar 17 '20 at 21:36
  • It does. I was looking for a thread like this on SO but couldn't find it. I may edit the thread to add more tags to it or change the text to make it match with more searches. Additionally, I am confused as to how to save the file to disk. Is there a specific extension that has to be used? Finally, the answer on that thread uses `cPickle`. Is there a difference between the two? – Arturo Sbr Mar 17 '20 at 21:54
  • You do not have to use a specific extension, but you could if you wanted. Some people use '.pickle'. While pickle is pure python, cPickle is a C-extension, which supposedly makes it faster. Actually, I'd try cPickle first. I also do not think that you have to edit your question. – Arco Bast Mar 17 '20 at 22:06

3 Answers3

3

There is plenty about model persistence on the sklearn documentation, but it advises you to either use pickle or joblib.

e.g. joblib

>>> from joblib import dump, load
>>> dump(clf, 'filename.joblib')
>>> clf = load('filename.joblib')

or pickle

>>> import pickle
>>> s = pickle.dumps(clf)
>>> clf2 = pickle.loads(s)

From the docs:

In the specific case of scikit-learn, it may be better to use joblib’s replacement of pickle (dump & load), which is more efficient on objects that carry large numpy arrays internally as is often the case for fitted scikit-learn estimators, but can only pickle to the disk and not to a string:

Adam
  • 709
  • 4
  • 16
1

This is the code snipet will work for you:

import pickle
# save the model to disk
filename = 'finalized_model.sav'
pickle.dump(clf, open(filename, 'wb'))

# some time later...

# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.score(X_test, Y_test)
print(result)

from this source.

your question has duplicate.

0

You can serialize the object with pickle or cloudpickle.This would work, as long as you make sure, that versions of the packages are the same in your remote and local environment.

For saving:

import pickle

with open('/path/to/file', 'w') as f:
    pickle.dump(clf, f)

For loading:

import pickle
with open('/path/to/file') as f:
    clf = pickle.load(f)
Arco Bast
  • 3,595
  • 2
  • 26
  • 53