XGBoost too large for pickle/joblib

Question

I'm having difficulty loading an XGBoost regression with both pickle and joblib.

One difficulty could be the fact I am writing the pickle/joblib on a Windows desktop, but I am trying to load on a Macbook Pro

I attempted to use this solution previously posted: Python 3 - Can pickle handle byte objects larger than 4GB?

however, it still does not work. I will get a variety of errors, but usually something like:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 22] Invalid argument

have also tried using protocol=4 in a pickle and joblib dump and in each instance, the file was still unable to load.

The files trying to be loaded have been anywhere from 2gb to 11gb based on joblib/pickle or using the bytes_in/os.path solution previously posted

Does anyone know a solution for optimal ways to write large XGBoost regressions, and/or how to then load them?

Here is the code used to write the XGBoost

dmatrix_train = xgb.DMatrix(
    X_train.values, y_train, feature_names=X_train.columns.values
)
dmatrix_validate = xgb.DMatrix(
    X_test.values, y_test, feature_names=X_train.columns.values
)
eval_set = [(dmatrix_train,"Train")]
eval_set.append((dmatrix_validate,"Validate"))

print("XGBoost #1")

params = {
    'silent': 1,

    'tree_method': 'auto',
    'max_depth': 10,
    'learning_rate': 0.001,
    'subsample': 0.1,
    'colsample_bytree': 0.3,
    # 'min_split_loss': 10,
    'min_child_weight': 10,
#     'lambda': 10,
#     'max_delta_step': 3
}

num_round = 500000

xgb_model = xgb.train(params=params, dtrain=dmatrix_train,evals=eval_set,
    num_boost_round=num_round, verbose_eval=100)

joblib.dump(xgb_model, 'file.sav', protocol=4)

The final line has also been tried with standard pickle dumping as well, with 'wb' and without.

Please add your code. This will help solving the issue. – razimbres Mar 07 '19 at 17:52 — razimbres, Mar 07 '19 at 17:52

score 0 · Answer 1 · answered Mar 07 '19 at 19:34

You appear to be using low-level XGBoost API (as opposed to high-level Scikit-Learn wrapper API). At this level, you can save/load XGBoost models natively using Booster.save_model(fname) and Booster.load_model(fname) functions.

For example, see this SO thread: How to save & load xgboost model?

Pickling makes sense if there's a significant "Python wrapper object" involved. There isn't one here.

XGBoost too large for pickle/joblib

1 Answers1