I am trying to train a series of KMeans clustering models from Scikit-Learn in separate processes using Python's mulitprocessing
library. When I try to use multiprocess.Pool
to train the models, the code runs without raising any runtime errors, but the execution never completes.
Further investigation reveals that the code only fails to terminate when the size in memory of the training data (X
in the code snippet below) exceeds 2^16 = 65536 bytes. Less than that, and the code behaves as expected.
import sys
import numpy as np
from multiprocessing import Pool
from sklearn.cluster import KMeans
# The Code Below Executes and Completes with MULTIPLIER = 227 but not when MULTIPLIER = 228
MULTIPLIER = 227
# Some Random Training Data
X = np.array(
[[ 0.19276125, -0.05182922, -0.06014779, 0.06234482, -0.00727767, -0.05975948],
[ 0.3541313, -0.29502648, 0.3088767, 0.02438405, -0.01978588, -0.00060496],
[ 0.22324295, -0.04291656, -0.0991894, 0.04455933, -0.00290042, 0.0316047 ],
[ 0.30497936, -0.03115212, -0.26681659, -0.00742825, 0.00978793, 0.00555566],
[ 0.1584528, -0.01984878, -0.03908984, -0.03246589, -0.01520335, -0.02516451],
[ 0.16888249, -0.04196552, -0.02432088, -0.02362059, 0.0353778, 0.02663082]]
* MULTIPLIER)
# Prints 65488 when MULTIPLIER = 227 and 65776 when MULTIPLIER = 228
print("Input Data Size: ", sys.getsizeof(X))
# Training without Multiprocessing Always Works Regardless of the Size of X
no_multiprocessing = KMeans(n_clusters=2, n_jobs=1).fit(X)
print("Training without multiprocessing complete!") # Always prints
# Training with Mulitprocessing Fails when X is too Large
def run_kmeans(X):
return KMeans(n_clusters=2, n_jobs=1).fit(X)
with Pool(processes=1) as p:
yes_multiprocessing = p.map(run_kmeans, [X])
print("Training with multiprocessing complete!") # Doesn't print when MULTIPLIER = 228
I am always very careful to keep the n_jobs
parameter set to 1
or None
so that I don't have my process spawning its own processes.
Curiously, this memory limit does not seem to be built-in to multiprocessing.Pool
as a "per element" memory limit, since I can pass in a very long string (consuming more than 65536 bytes) and the code terminates without complaint.
import sys
from multiprocessing import Pool
my_string = "This sure is a silly string" * 2500
print("String size:", sys.getsizeof(y)) # Prints 79554
def add_exclamation(x):
return x + "!"
with Pool(processes=1) as p:
my_string = p.map(add_exclamation, [my_string])
print("Multiprocessing Completed!") # Prints Just Fine
Terminating the execution of the first code snippet when it gets hung up always results in the following error message:
File "/path/to/my/code", line 29, in <module>
yes_multiprocessing = p.map(run_kmeans, [X])
File "/.../anaconda3/envs/Main36Env/lib/python3.6/multiprocessing/pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/.../anaconda3/envs/Main36Env/lib/python3.6/multiprocessing/pool.py", line 638, in get
self.wait(timeout)
File "/.../anaconda3/envs/Main36Env/lib/python3.6/multiprocessing/pool.py", line 635, in wait
self._event.wait(timeout)
File "/.../anaconda3/envs/Main36Env/lib/python3.6/threading.py", line 551, in wait
signaled = self._cond.wait(timeout)
File "/.../anaconda3/envs/Main36Env/lib/python3.6/threading.py", line 295, in wait
waiter.acquire()
KeyboardInterrupt
I have tried forcing my MacOS system to spawn processes instead of forking them, as recommended here. I've investigated suggestions like making sure all relevant code exists within a with
block, and avoiding an iPython environment (executing python code straight from the terminal) to no avail. Changing the number of Pool
processes also has no impact. I have also tried switching from multiprocessing.Pool
to multiprocessing.Process
to avoid daemonic Pool
processes from trying to spawn processes from within the KMeans joblib
integration, as discussed here, without success.
How can I train multiple KMeans models on separate processes with training data which exceeds 65536 bytes?