I want to create a persistent scikit-learn model and reference to it later via a hash. Using joblib for serialization, I would expect full (bit-level) integrity if there are no changes in my data. But every time I run the code, the model file on disk has a different hash. Why is that and how can I make really identical serialization every time I run the code unchanged? Setting a fixed seed did not help (not sure if sklearn's algorithm utilizes random numbers in this simple example at all).
import numpy as np
from sklearn import linear_model
import joblib
import hashlib
# set a fixed seed …
np.random.seed(1979)
# internal md5sum function
def md5(fname):
hash_md5 = hashlib.md5()
with open(fname, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()
# dummy regression data
X = [[0., 0., 0.,1.], [1.,0.,0.,0.], [2.,2.,0.,1.], [2.,5.,1.,0.]]
Y = [[0.1, -0.2], [0.9, 1.1], [6.2, 5.9], [11.9, 12.3]]
# create model
reg = linear_model.LinearRegression()
# save model to disk to make it persistent
with open("reg.joblib", "w"):
joblib.dump(reg, "reg.joblib")
# load persistant model from disk
with open("reg.joblib", "r"):
model = joblib.load("reg.joblib")
# fit & predict
reg.fit(X,Y)
model.fit(X,Y)
myprediction1 = reg.predict([[2., 2., 0.1, 1.1]])
myprediction2 = model.predict([[2., 2., 0.1, 1.1]])
# run several times … why does md5sum change everytime?
print(md5("reg.joblib"))
print(myprediction1, myprediction2)