Multiprocessing Fails to Train Scikit Learn Models with More Than 65536 Bytes of Training Data

Question

I am trying to train a series of KMeans clustering models from Scikit-Learn in separate processes using Python's mulitprocessing library. When I try to use multiprocess.Pool to train the models, the code runs without raising any runtime errors, but the execution never completes.

Further investigation reveals that the code only fails to terminate when the size in memory of the training data (X in the code snippet below) exceeds 2^16 = 65536 bytes. Less than that, and the code behaves as expected.

import sys
import numpy as np
from multiprocessing import Pool
from sklearn.cluster import KMeans

# The Code Below Executes and Completes with MULTIPLIER = 227 but not when MULTIPLIER = 228
MULTIPLIER = 227

# Some Random Training Data
X = np.array(
    [[ 0.19276125, -0.05182922, -0.06014779, 0.06234482, -0.00727767, -0.05975948],
     [ 0.3541313,  -0.29502648,  0.3088767, 0.02438405, -0.01978588, -0.00060496],
     [ 0.22324295, -0.04291656, -0.0991894, 0.04455933, -0.00290042, 0.0316047 ],
     [ 0.30497936, -0.03115212, -0.26681659, -0.00742825,  0.00978793, 0.00555566],
     [ 0.1584528,  -0.01984878, -0.03908984, -0.03246589, -0.01520335, -0.02516451],
     [ 0.16888249, -0.04196552, -0.02432088, -0.02362059,  0.0353778, 0.02663082]]
    * MULTIPLIER)

# Prints 65488 when MULTIPLIER = 227 and 65776 when MULTIPLIER = 228
print("Input Data Size: ", sys.getsizeof(X)) 

# Training without Multiprocessing Always Works Regardless of the Size of X
no_multiprocessing = KMeans(n_clusters=2, n_jobs=1).fit(X)
print("Training without multiprocessing complete!") # Always prints

# Training with Mulitprocessing Fails when X is too Large
def run_kmeans(X):
    return KMeans(n_clusters=2, n_jobs=1).fit(X)

with Pool(processes=1) as p:
   yes_multiprocessing = p.map(run_kmeans, [X])
print("Training with multiprocessing complete!") # Doesn't print when MULTIPLIER = 228

I am always very careful to keep the n_jobs parameter set to 1 or None so that I don't have my process spawning its own processes.

Curiously, this memory limit does not seem to be built-in to multiprocessing.Pool as a "per element" memory limit, since I can pass in a very long string (consuming more than 65536 bytes) and the code terminates without complaint.

import sys
from multiprocessing import Pool

my_string = "This sure is a silly string" * 2500

print("String size:", sys.getsizeof(y)) # Prints 79554

def add_exclamation(x):
    return x + "!"

with Pool(processes=1) as p:
    my_string = p.map(add_exclamation, [my_string])

print("Multiprocessing Completed!") # Prints Just Fine

Terminating the execution of the first code snippet when it gets hung up always results in the following error message:

  File "/path/to/my/code", line 29, in <module>
    yes_multiprocessing  = p.map(run_kmeans, [X])

  File "/.../anaconda3/envs/Main36Env/lib/python3.6/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()

  File "/.../anaconda3/envs/Main36Env/lib/python3.6/multiprocessing/pool.py", line 638, in get
    self.wait(timeout)

  File "/.../anaconda3/envs/Main36Env/lib/python3.6/multiprocessing/pool.py", line 635, in wait
    self._event.wait(timeout)

  File "/.../anaconda3/envs/Main36Env/lib/python3.6/threading.py", line 551, in wait
    signaled = self._cond.wait(timeout)

  File "/.../anaconda3/envs/Main36Env/lib/python3.6/threading.py", line 295, in wait
    waiter.acquire()

KeyboardInterrupt

I have tried forcing my MacOS system to spawn processes instead of forking them, as recommended here. I've investigated suggestions like making sure all relevant code exists within a with block, and avoiding an iPython environment (executing python code straight from the terminal) to no avail. Changing the number of Pool processes also has no impact. I have also tried switching from multiprocessing.Pool to multiprocessing.Process to avoid daemonic Pool processes from trying to spawn processes from within the KMeans joblib integration, as discussed here, without success.

How can I train multiple KMeans models on separate processes with training data which exceeds 65536 bytes?

score 0 · Accepted Answer · answered Sep 28 '20 at 17:21

0

After much more trial and error, the issue seems to be an environment error since running the above code in a fresh environment worked. I'm not totally sure which of my packages caused the problem.

answered Sep 28 '20 at 17:21

NolantheNerd

338
3
10

score 0 · Answer 2 · answered Aug 12 '21 at 08:59

I had a similar problem - when working without multiprocessing everything worked as expected but with multiprocessing the execution never completed (I didn't check the if the size of the data matters).

TL;DR: Placing all the relevant imports inside the function (run_kmeans in your case) solved the problem. But only inside the function.

My assumption: When you use multiprocess with the imports in the "global process" (meaning - outside the function), the fork happens after the imports, meaning that if the library that you imported (say sklearn) uses for some reason the process id, then it may break if after the fork the process id changes but the library still thinks that the process id is the previous one - that from before the fork. That's why doing the import inside the function creates a situation in which the import is done after the process id (or maybe something else that happens after the fork) is determined. It must be only inside the function (and not in both the "global process" and the function) because python doesn't import the libraries again if they were already imported.

Multiprocessing Fails to Train Scikit Learn Models with More Than 65536 Bytes of Training Data

2 Answers2