5

I have a very large dataset (7 million rows, 54 features) that I would like to fit a regression model to using XGBoost. To train the best possible model, I want to use BayesSearchCV from scikit-optimize to run the fit repeatedly for different hyperparameter combinations until the best performing set is found.

For a given set of hyperparameters, XGBoost takes a very long time to train a model, so in order to find the best hyperparameters without spending days on every permutation of training folds, hyperparameters, etc., I want to multithread both XGBoost and BayesSearchCV. The relevant part of my code looks like this:

xgb_pipe = Pipeline([('clf', XGBRegressor(random_state = 42,  objective='reg:squarederror', n_jobs = 1))])

xgb_fit_params = {'clf__early_stopping_rounds': 5, 'clf__eval_metric': 'mae', 'clf__eval_set': [[X_val.values, y_val.values]]}

xgb_kfold = KFold(n_splits = 5, random_state = 42)

xgb_unsm_cv = BayesSearchCV(xgb_pipe, xgb_params, cv = xgb_kfold, n_jobs = 2, n_points = 1, n_iter = 15, random_state = 42, verbose = 4, scoring = 'neg_mean_absolute_error', fit_params = xgb_fit_params)

xgb_unsm_cv.fit(X_train.values, y_train.values)

However, I've found that when n_jobs > 1 in the BayesSearchCV call, the fit crashes and I get the following error:

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {SIGKILL(-9)}

This error persists whenever I use more than 1 thread in the BayesSearchCV call, and is independent of the memory I provide.

Is this some fundamental incompatibility between XGBoost and scikit-optimize, or can both packages be forced to work together somehow? Without some way of multithreading the optimization, I fear that fitting my model will take weeks to perform. What can I do to fix this?

  • Do you really need to multithread both loops? – Dimitry Jul 16 '21 at 14:11
  • @Dimitry: I think I very much need to. Multithreading the XGBoost call means that the model trains in 4 hours instead of 23 - I have a lot of data - while I understand that at least 20 iterations are required to find an optimal parameter set in Bayesian Optimisation. How else should this be done? – Electronic Ant Jul 16 '21 at 19:36
  • Well, doesn't it utilize all the available cpus for those 4 hours? I'm not overly familiar with these particular libraries, but it sounds uncommon that you need to multithread the inner task and then also multithread the outer. – Dimitry Jul 17 '21 at 10:14
  • @Dimitry: XGBoost only uses all available CPUs if n_jobs = -1. What I was hoping to do was to get the Bayesian Optimisation to simultaneously evaluate several possible points in the parameter space, by setting n_jobs > 1, in order to find the optimal set faster. – Electronic Ant Jul 17 '21 at 17:06

1 Answers1

4

I don't think the error has something to do with the incompatibility of the libraries. Rather, since you are asking for two different multi-thread operations, you are running out of the the memory as your program is trying to put the complete dataset onto your RAM not once but twice for multiple instances (depending on the threads).

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {SIGKILL(-9)}

Segmentation Fault refers to an error where the system ran out of available memory.

Note that XGBoost is a RAM hungry beast, coupling it with another multi-threaded operation is bound to take a toll(and personally, not recommended with daily driver machines.)

The most viable solution would be to probably use Google's TPU or some other cloud service (beware of the costs), or use some technique to reduce the size of the dataset for processing using some statistical techniques like the ones mentioned in this kaggle notebook and Data Science StackExchange Article.

The idea is, either you upscale the hardware (monetary cost), go head-on with single thread BayesianCV (time cost) or downsize the data using whatever technique best suits you.

Finally, the answer still is that the libraries are probably compatible, just the data is too large for the available RAM.

inarticulatus
  • 175
  • 1
  • 10