Memory leak with any n_jobs doing cross validation

Question

I am running a code to do a binary classification and predict labels afterwards. The code runs perfectly with one specific database of size 257673 rows and 47 columns. When I try with one of 91690 rows and 10 columns, which is certainly smaller, I get the error call:

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGKILL(-9)}

I am using a cross validation line with n_job=-1

cross_val_score(model, X.drop(target,axis=1), X[target], cv=outer_cv, n_jobs=-1, scoring='neg_mean_squared_error')

outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

model is any of the sklearn algorithms, I have tried with AdaBoostClassifier, LogisticRegression, KNN, SVM, GradientBoosting, RandomForest, DecisionTreeClassifier... and many others and I keep getting the same error.

I have tried changing n_jobs=-2, 1, 2 and still the error persists. I am running the code on a jupyter notebook and my laptop has the following properties:

Ubuntu 18.04.4 LTS
RAM: 15,5 Gb
Processor: Intel® Core™ i7-8550U CPU @ 1.80GHz × 8

How could I solve this issue?

Maybe you can find what you are looking for here https://github.com/scikit-learn-contrib/skope-rules/issues/18 or here https://stackoverflow.com/questions/54139403/how-do-i-fix-debug-this-multi-process-terminated-worker-error-thrown-in-scikit-l — Meto, May 23 '20 at 19:21
Thanks @Meto but I checked all that before posting my question... — Ernesto Lopez Fune, May 23 '20 at 19:28
Did you check if the smaller DB has clean data, ie. consistent types in columns, no NaNs, no missing field or line separators etc.? — mac13k, May 26 '20 at 12:36
@mac13k the whole database is already clean and pre-processed, ready to feed the Machine Learning toolbox... that's why I am wondering what else could be happening. — Ernesto Lopez Fune, May 26 '20 at 14:36
Have you tried a newer version? https://scikit-learn.org/stable/whats_new.html — mac13k, May 26 '20 at 15:24
@mac13k I updated, I restarted and still posting the same error — Ernesto Lopez Fune, May 26 '20 at 15:46
OK, tough nut. But judging by the error message this is not a mem leak, but seg fault or OOM. You can try to debug it using strace or gdb. — mac13k, May 26 '20 at 15:58
Also you can try to load smaller portions of the DB to your model - if the problem occurs only for some portions of the DB but not all, you will be able to narrow down to the problematic rows. — mac13k, May 26 '20 at 18:07
Can you print the schema of the data input ? Also, if that's possible, can you post the dataset (if it's not confidential)? — Guillaume, Jun 01 '20 at 19:08
You said `The code runs perfectly with one specific database of size 257673 rows and 47 columns` Do you mean it runs fine without CV and the smaller dataset fails with OOM when doing CV ? — mujjiga, Jun 01 '20 at 21:02

score 0 · Accepted Answer · answered Jun 02 '20 at 11:57

I found the answer to this question. It seems that some of the scikit-learn algorithms produce these errors depending on the encoding method of categorical features. In my case, I had to remove from the list of algorithms I was using: CategoricalNB(), Ridge(), ElasticNet() and GaussianProcessClassifier(), because they were producing the bugs either with StandardScaler() or with MinMaxScaler().

Memory leak with any n_jobs doing cross validation

1 Answers1