6

I am running a code to do a binary classification and predict labels afterwards. The code runs perfectly with one specific database of size 257673 rows and 47 columns. When I try with one of 91690 rows and 10 columns, which is certainly smaller, I get the error call:

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGKILL(-9)} 

I am using a cross validation line with n_job=-1

cross_val_score(model, X.drop(target,axis=1), X[target], cv=outer_cv, n_jobs=-1, scoring='neg_mean_squared_error')

outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

model is any of the sklearn algorithms, I have tried with AdaBoostClassifier, LogisticRegression, KNN, SVM, GradientBoosting, RandomForest, DecisionTreeClassifier... and many others and I keep getting the same error.

I have tried changing n_jobs=-2, 1, 2 and still the error persists. I am running the code on a jupyter notebook and my laptop has the following properties:

Ubuntu 18.04.4 LTS
RAM: 15,5 Gb
Processor: Intel® Core™ i7-8550U CPU @ 1.80GHz × 8

How could I solve this issue?

  • Maybe you can find what you are looking for here https://github.com/scikit-learn-contrib/skope-rules/issues/18 or here https://stackoverflow.com/questions/54139403/how-do-i-fix-debug-this-multi-process-terminated-worker-error-thrown-in-scikit-l – Meto May 23 '20 at 19:21
  • Thanks @Meto but I checked all that before posting my question... – Ernesto Lopez Fune May 23 '20 at 19:28
  • Did you check if the smaller DB has clean data, ie. consistent types in columns, no NaNs, no missing field or line separators etc.? – mac13k May 26 '20 at 12:36
  • @mac13k the whole database is already clean and pre-processed, ready to feed the Machine Learning toolbox... that's why I am wondering what else could be happening. – Ernesto Lopez Fune May 26 '20 at 14:36
  • Which version of sklearn are you using? – mac13k May 26 '20 at 15:17
  • @mac13k I am using version 0.22.1. – Ernesto Lopez Fune May 26 '20 at 15:21
  • Have you tried a newer version? https://scikit-learn.org/stable/whats_new.html – mac13k May 26 '20 at 15:24
  • @mac13k I updated, I restarted and still posting the same error – Ernesto Lopez Fune May 26 '20 at 15:46
  • OK, tough nut. But judging by the error message this is not a mem leak, but seg fault or OOM. You can try to debug it using strace or gdb. – mac13k May 26 '20 at 15:58
  • 1
    Also you can try to load smaller portions of the DB to your model - if the problem occurs only for some portions of the DB but not all, you will be able to narrow down to the problematic rows. – mac13k May 26 '20 at 18:07
  • Can you print the schema of the data input ? Also, if that's possible, can you post the dataset (if it's not confidential)? – Guillaume Jun 01 '20 at 19:08
  • You said `The code runs perfectly with one specific database of size 257673 rows and 47 columns` Do you mean it runs fine without CV and the smaller dataset fails with OOM when doing CV ? – mujjiga Jun 01 '20 at 21:02

1 Answers1

0

I found the answer to this question. It seems that some of the scikit-learn algorithms produce these errors depending on the encoding method of categorical features. In my case, I had to remove from the list of algorithms I was using: CategoricalNB(), Ridge(), ElasticNet() and GaussianProcessClassifier(), because they were producing the bugs either with StandardScaler() or with MinMaxScaler().