Multi processing code repeatedly runs

Question

So I wish to create a process using the python multiprocessing module, I want it be part of a larger script. (I also want a lot of other things from it but right now I will settle for this)

I copied the most basic code from the multiprocessing docs and modified it slightly

However, everything outside of the if __name__ == '__main__': statement gets repeated every time p.join() is called.

This is my code:

from multiprocessing import Process

data = 'The Data'
print(data)

# worker function definition
def f(p_num):
    print('Doing Process: {}'.format(p_num))

print('start of name == main ')

if __name__ == '__main__':
    print('Creating process')
    p = Process(target=f, args=(data,))
    print('Process made')
    p.start()
    print('process started')
    p.join()
    print('process joined')

print('script finished')

This is what I expected:

The Data
start of name == main 
Creating process
Process made
process started
Doing Process: The Data
process joined
script finished

Process finished with exit code 0

This is the reality:

The Data
start of name == main 
Creating process
Process made
process started
The Data                         <- wrongly repeated line
start of name == main            <- wrongly repeated line
script finished                  <- wrongly executed early line
Doing Process: The Data
process joined
script finished

Process finished with exit code 0

I am not sure whether this is caused by the if statement or p.join() or something else and by extension why this is happening. Can some one please explain what caused this and why?

For clarity because some people cannot replicate my problem but I have; I am using Windows Server 2012 R2 Datacenter and I am using python 3.5.3.

When using the multiprocessing module, a whole new python process is created, meaning the python script is essentially duplicated. The only difference is the newly created process will have a target method, and that target is what is executed. All the code outside of any function definitions is run the same way any python script behaves. That's my understanding. — Peter, Aug 09 '17 at 13:29
Cannot reproduce your result. The code works fine on my machine and also http://www.tutorialspoint.com/execute_python_online.php?PID=0Bw_CjBb95KQMVFJETFNzZG1rX1U — aristotll, Aug 09 '17 at 13:32
@aristotll I cannot exaplain why it works online but I have run this on two computers here and it fails everytime - but none the less, thanks for checking. — Harry de winton, Aug 09 '17 at 13:48
@Peter you answer seems reasonable, do you know somewhere that explains this a tad more thoroughly? becasue I cannot find this in literature — Harry de winton, Aug 09 '17 at 13:51
@aristotll: Just a guess but are you using some kind of Unix? The unix version will fork by default, rather than create a new process: https://docs.python.org/3.6/library/multiprocessing.html#contexts-and-start-methods I assume since the forking process has already run the script outside of the function definitions before forking, no output is going to be emitted more than once — PaulR, Aug 09 '17 at 15:44
I am running a code in windows machine for parallelizing `pandas.DataFrame.apply()` and same thing happened with me. — Jihjohn, May 05 '21 at 08:27

score 9 · Accepted Answer · answered Aug 09 '17 at 15:36

9

The way Multiprocessing works in Python is such that each child process imports the parent script. In Python, when you import a script, everything not defined within a function is executed. As I understand it, __name__ is changed on an import of the script (Check this SO answer here for a better understanding), which is different than if you ran the script on the command line directly, which would result in __name__ == '__main__'. This import results in __name__ not equalling '__main__', which is why the code in if __name__ == '__main__': is not executed for your subprocess.

Anything you don't want executed during subprocess calls should be moved into your if __name__ == '__main__': section of your code, as this will only run for the parent process, i.e. the script you run initially.

Hope this helps a bit. There are some more resources around Google that better explain this if you look around. I linked the official Python resource for the multiprocessing module, and I recommend you look through it.

answered Aug 09 '17 at 15:36

Peter

323
1
6

Fantastic answer, exactly what I was looking for, however is there another way to hide unnecessary functions and modules from the child processes (to try and save time and space) apart from including them in the `if` statement or does loading other functions/classes/modules etc take so little time it is not worth it? – Harry de winton Aug 10 '17 at 08:29
I don't think the time saved would be substantial - it shouldn't matter. – Peter Aug 10 '17 at 12:55
Okay, I am sticking to using it as explained – Harry de winton Aug 10 '17 at 14:08
spot-on answer. Thanks – Gihan Gamage Jul 10 '20 at 04:47
Didn't work. Added code in `__name__ == '__main__'` and it ran multiple times – whackamadoodle3000 Aug 13 '20 at 06:12

score 0 · Answer 2 · edited Dec 30 '18 at 00:48

Exploring the topic I run into an issue of multiple loads of modules. to make it work per above I had to:

put all imports in a function (initializer())
return all imports as objects in the call to initializer() function
reference those objects in definition and calls to the remaining functions in my module

the example module below runs multiple classification approaches on the same dataset in parallel:

print("I am being run so often because: https://stackoverflow.com/questions/45591987/multi-processing-code-repeatedly-runs")

def initializer():
    from sklearn import datasets

    iris = datasets.load_iris()
    x = iris.data
    y = iris.target    

    from sklearn.preprocessing import StandardScaler as StandardScaler
    from sklearn.metrics import accuracy_score
    from sklearn.linear_model import Perceptron
    from sklearn.linear_model import LogisticRegression
    from sklearn.pipeline import Pipeline
    import multiprocessing as mp
    from multiprocessing import Manager

    results = [] # for some reason it needs to be defined before the if __name__ = __main__

    return x, y, StandardScaler, accuracy_score, Perceptron, LogisticRegression, Pipeline, mp, Manager, results

def perceptron(x,y,results, StandardScaler, accuracy_score, Perceptron, LogisticRegression, Pipeline):
    scaler = StandardScaler()
    estimator = ["Perceptron", Perceptron(n_iter=40, eta0=0.1, random_state=1)]

    pipe =  Pipeline([('Scaler', scaler),
                      ('Estimator', estimator[1])])

    pipe.fit(x,y)

    y_pred_pipe = pipe.predict(x)
    accuracy = accuracy_score(y, y_pred_pipe)
    result = [estimator[0], estimator[1], pipe, y_pred_pipe, accuracy]
    results.append(result)
    print(estimator[0], "Accuracy: ",accuracy)
    return results

def logistic(x,y,results,StandardScaler, accuracy_score, Perceptron, LogisticRegression, Pipeline):
    scaler = StandardScaler()
    estimator = ["LogisticRegression", LogisticRegression(C=100.0, random_state=1)]

    pipe =  Pipeline([('Scaler', scaler),
                      ('Estimator', estimator[1])])

    pipe.fit(x,y)

    y_pred_pipe = pipe.predict(x)
    accuracy = accuracy_score(y, y_pred_pipe)
    result = [estimator[0], estimator[1], pipe, y_pred_pipe, accuracy]
    #results = []
    results.append(result)
    print(estimator[0], "Accuracy: ",accuracy)
    return results

def parallel(x,y,results,StandardScaler, accuracy_score, Perceptron, LogisticRegression, Pipeline):
    with Manager() as manager:

        tasks = [perceptron, logistic,]
        results = manager.list() 
        procs = []
        for task in tasks:
            proc = mp.Process(name=task.__name__, target=task, args=(x,y,results,StandardScaler, accuracy_score, Perceptron, LogisticRegression, Pipeline))
            procs.append(proc)
            print("done with check 1")
            proc.start()
            print("done with check 2")

        for proc in procs:
            print("done with check 3")
            proc.join()
            print("done with check 4")

        results = list(results)
        print("Within WITH")
        print(results)

    print("Within def")
    print(results)
    return results 

if __name__ == '__main__':
    __spec__ = "ModuleSpec(name='builtins', loader=<class '_frozen_importlib.BuiltinImporter'>)"

    x, y, StandardScaler, accuracy_score, Perceptron, LogisticRegression, Pipeline, mp, Manager, results = initializer()

    results = parallel(x,y,results,StandardScaler, accuracy_score, Perceptron, LogisticRegression, Pipeline)

    print("Outside of def")
    print(type(results))
    print(len(results))

    print(results[1]) # must be within IF as otherwise does not work ?!?!?!?

    cpu_count = mp.cpu_count()
    print("CPUs: ", cpu_count)

Multi processing code repeatedly runs

2 Answers2

Linked