Sklearn's clone shows unexpected behavior with multiprocessing in python on Red Hat

Question

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn import clone
import multiprocessing
import functools
import numpy as np

def train_model(n_estimators, base_model, X, y):
    model = clone(base_model)
    model.set_params(n_estimators=n_estimators)
    model.fit(X,y)
    return model


class A():
    def __init__(self, random_state, jobs, **kwargs):
        self.model = RandomForestClassifier(oob_score=True, random_state=random_state, **kwargs)
        self.jobs = jobs


    def fit(self, X, y):
        job_pool = multiprocessing.Pool(self.jobs)
        n_estimators = [100]
        for output in job_pool.imap_unordered(functools.partial(train_model,
                                                                base_model=self.model,
                                                                X=X,
                                                                y=y),n_estimators):
            model = output
        job_pool.terminate()
        self.model = model


if __name__ == '__main__':

    np.random.seed(42)
    X, y = make_classification(n_samples=500,n_informative=6,n_redundant=6, flip_y=0.1)

    print "Class A"
    for i in range(5):
        base_model = A(random_state=None, jobs=1)
        base_model.fit(X,y)
        print base_model.model.oob_score_

    print "Bare RF"
    base_model = RandomForestClassifier(n_estimators=500, max_features=2, oob_score=True, random_state=None)
    for i in range(5):
        model = clone(base_model)
        model.fit(X,y)
        print model.oob_score_

Output on a Windows 7 machine (Python 2.7.13):
(pip freeze : numpy==1.11.0, scikit-image==0.12.3, scikit-learn==0.17, scipy==0.17.0)

Class A
0.82
0.826
0.832
0.822
0.816

Bare RF
0.814
0.81
0.818
0.818
0.818

Output on a Red Hat 4.8.3-9 Linux machine (Python 2.7.5):
(pip freeze: numpy==1.11.0, scikit-learn==0.17, scipy==0.17.0, sklearn==0.0)
Class A
0.818
0.818
0.818
0.818
0.818

Bare RF
0.814
0.81
0.818
0.818
0.818

So, to sum up:
In Linux, "Class A" (which uses multiprocessing) seems to be training the same exact model, hence the same scores. Whereas the behavior I would expect would be the one of the "Bare RF" section where the scores do not coincide (it's a random algorithm). In Windows (Pycharm), the issue cannot be reproduced.

Can you please help?

BIG EDIT: Created a reproducible code example.

Even though I can't say I agree with the implementation you have provided, I think you have provided part of the solution to what is happening here :) : `clone(base_model)`. Use this in fit `base_model=clone(self.model)`, before it gets distributed — mkaran, May 29 '17 at 10:21
To add to @mkaran 's answer. The problem is the initialization of estimator. In your case, you are not re-initializing the model in each fit, which means that the weights or coeffs which the model has learnt in the previous fit are mostly same (even if fit() is called again), whereas scikit-learn's clone will re-initialize the model and the weights. — Vivek Kumar, May 29 '17 at 10:26
@ Vivek Kumar can you please elaborate (with some code maybe?). Thank you! — user152245, May 29 '17 at 12:25
Can you provide [some data and complete code](https://stackoverflow.com/help/mcve) so that we can copy paste and analyze it? — Vivek Kumar, May 29 '17 at 12:37
And you said that you tried the `base_model=clone(self.model)` suggestion by mkaran. Can you tell how did you use it? You need to use it inside your `functools.partial()` method. — Vivek Kumar, May 29 '17 at 12:39
@VivekKumar : yes, I've tried it both in the call of functools.partial and above it (out of the loop). Any clues ? — user152245, Jun 01 '17 at 11:55
Out of the loop makes no sense. Are you not getting any difference when calling it in `functools.partial`? — Vivek Kumar, Jun 01 '17 at 12:02
No, still the same result when I tried in the call like: functools.partial(train_model,base_model=clone(self.model),X=X, y=y) — user152245, Jun 01 '17 at 12:21

score 0 · Accepted Answer · edited Jun 20 '20 at 09:12

The solution is to add a reseed inside "train_model" which is executed in parallel.

def train_model(n_estimators, base_model, X, y):
    np.random.seed()
    model = clone(base_model)
    model.set_params(n_estimators=n_estimators)
    model.fit(X,y)
    return model

The reasoning:

What happens is that on Unix every worker process inherits the same state of the random number generator from the parent process. This is why they generate identical pseudo-random sequences.

It is multiprocessing which actually launches the worker processes, that's why this is relevant. So this is not a scikit-learn clone issue.

I've found the answer here and here

Sklearn's clone shows unexpected behavior with multiprocessing in python on Red Hat

1 Answers1