The code below is completely reproducible when n_jobs=1
at cross_validate
function, but not so when n_jobs=-1
or 2
.
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_validate,RepeatedStratifiedKFold
class DecisionTree(DecisionTreeClassifier):
def fit(self,X,Y):
weight = np.random.uniform(size=Y.shape)
return super().fit(X,Y,sample_weight=weight)
def main():
X,Y = load_iris(return_X_y=True)
rks = RepeatedStratifiedKFold(n_repeats=2,n_splits=5,random_state=42)
clf = DecisionTree(random_state=42)
res = cross_validate(clf,X,Y,cv=rks,n_jobs=2)['test_score']*100
return res.mean(),res.std()
if __name__=='__main__':
np.random.seed(42)
print(main())
Please note the np.random.uniform
call at fit
function. The code is also completely reproducible without such numpy calls. It is mentioned here that numpy.random.seed
is not thread-safe. But I saw no mention of this in sklearn's FAQ, according to which providing random_state
everywhere should suffice.
Is there anyway to use both numpy random calls and multiprocessing in sklearn while maintaining full reproducibility?
EDIT: I think it reproduces fine if we put n_jobs>1
inside estimators that take it, while instantiating RandomForestClassifier
for example.