I'm trying to train models in batches to improve memory usage.
Here's my example of incremental training for gradient boosting in xgboost
:
It uses xgb_model
to train in batches. But this training creates models that perform only as well as the models trained on a single batch.
How can I reduce the errors induced by incremental training?
Details
My incremental training:
def xgb_native_batch(batch_size=100):
"""Train in batches that update the same model"""
batches = int(np.ceil(len(y_train) / batch_size))
dtrain = xgb.DMatrix(data=X_train, label=y_train)
if XGB_MODEL_FILE:
# Save the model
bst = xgb.train(
params=xgb_train_params,
dtrain=dtrain,
num_boost_round=0
) # type: Booster
bst.save_model(XGB_MODEL_FILE)
else:
# OR just use an empty Booster class
bst = None
for i in range(batches):
start = i * batch_size
end = start + batch_size
dtrain = xgb.DMatrix(X_train[start:end, :], y_train[start:end])
bst = xgb.train(
dtrain=dtrain,
params=xgb_train_params,
xgb_model=XGB_MODEL_FILE or bst
) # type: Booster
if XGB_MODEL_FILE:
bst.save_model(XGB_MODEL_FILE)
dtest = xgb.DMatrix(data=X_test, label=y_test)
pr_y_test_hat = bst.predict(dtest)
return pr_y_test_hat
Tests
The tests based on four datasets. I created these models:
xgb_native_bulk
is the reference model trained on all data at once.xgb_native_bulk_<N>
is the model trained on a subsample of size N.xgb_native_batch_<N>
is the model trained continuously on all the data divided into small batches of size N (continuous learning through model updates):
Metrics:
make_classification: binary, N=3750
========================================
accuracy_score aurocc
algorithm
xgb_native_bulk 0.8624 0.933398
xgb_native_bulk_100 0.6192 0.669542
xgb_native_batch_100 0.6368 0.689123
xgb_native_bulk_500 0.7440 0.837590
xgb_native_batch_500 0.7528 0.829661
xgb_native_bulk_1000 0.7944 0.880586
xgb_native_batch_1000 0.8048 0.886607
load_breast_cancer: binary, N=426
========================================
accuracy_score aurocc
algorithm
xgb_native_bulk 0.958042 0.994902
xgb_native_bulk_100 0.930070 0.986037
xgb_native_batch_100 0.965035 0.989805
xgb_native_bulk_500 0.958042 0.994902
xgb_native_batch_500 0.958042 0.994902
xgb_native_bulk_1000 0.958042 0.994902
xgb_native_batch_1000 0.958042 0.994902
make_regression: reg, N=3750
========================================
mse
algorithm
xgb_native_bulk 5.513056e+04
xgb_native_bulk_100 1.209782e+05
xgb_native_batch_100 7.872892e+07
xgb_native_bulk_500 8.694831e+04
xgb_native_batch_500 1.150160e+05
xgb_native_bulk_1000 6.953936e+04
xgb_native_batch_1000 5.060867e+04
load_boston: reg, N=379
========================================
mse
algorithm
xgb_native_bulk 15.910990
xgb_native_bulk_100 25.160251
xgb_native_batch_100 16.931899
xgb_native_bulk_500 15.910990
xgb_native_batch_500 15.910990
xgb_native_bulk_1000 15.910990
xgb_native_batch_1000 15.910990
The problem is that incremental learning isn't doing well with long and wide datasets. For example, the classification problem:
accuracy_score aurocc
algorithm
xgb_native_bulk 0.8624 0.933398
xgb_native_bulk_100 0.6192 0.669542
xgb_native_batch_100 0.6368 0.689123
No difference between the models trained on 100 rows at once vs 3750 rows in batches of 100. And both are far from the reference model trained on 3750 rows at once.
References
- How can I implement incremental training for xgboost?
- Examples of incremental learning from the
xgboost
repo: https://github.com/dmlc/xgboost/blob/master/tests/python/test_training_continuation.py