scikit-learn suggests the use of pickle for model persistence. However, they note the limitations of pickle when it comes to different version of scikit-learn or python. (See also this stackoverflow question)
In many machine learning approaches, only few parameters are learned from large data sets. These estimated parameters are stored in attributes with trailing underscore, e.g. coef_
Now my question is the following: Can model persistence be achieved by persisting the estimated attributes and assigning to them later? Is this approach safe for all estimators in scikit-learn, or are there potential side-effects (e.g. private variables that have to be set) in the case of some estimators?
It seems to work for logistic regression, as seen in the following example:
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
try:
from sklearn.model_selection import train_test_split
except ImportError:
from sklearn.cross_validation import train_test_split
iris = datasets.load_iris()
tt_split = train_test_split(iris.data, iris.target, test_size=0.4)
X_train, X_test, y_train, y_test = tt_split
# Here we train the logistic regression
lr = LogisticRegression(class_weight='balanced')
lr.fit(X_train, y_train)
print(lr.score(X_test, y_test)) # prints 0.95
# Persisting
params = lr.get_params()
coef = lr.coef_
intercept = lr.intercept_
# classes_ is not documented as public member,
# but not explicitely private (not starting with underscore)
classes = lr.classes_
lr.n_iter_ #This is meta-data. No need to persist
# Now we try to load the Classifier
lr2 = LogisticRegression()
lr2.set_params(**params)
lr2.coef_ = coef
lr2.intercept_ = intercept
lr2.classes_ = classes
print(lr2.score(X_test, y_test)) #Prints the same: 0.95