3

I'm trying to train models in batches to improve memory usage.

Here's my example of incremental training for gradient boosting in xgboost:

It uses xgb_model to train in batches. But this training creates models that perform only as well as the models trained on a single batch.

How can I reduce the errors induced by incremental training?

Details

My incremental training:

def xgb_native_batch(batch_size=100):
    """Train in batches that update the same model"""

    batches = int(np.ceil(len(y_train) / batch_size))

    dtrain = xgb.DMatrix(data=X_train, label=y_train)
    if XGB_MODEL_FILE:
        # Save the model
        bst = xgb.train(
            params=xgb_train_params,
            dtrain=dtrain,
            num_boost_round=0
        )  # type: Booster
        bst.save_model(XGB_MODEL_FILE)
    else:
        # OR just use an empty Booster class
        bst = None

    for i in range(batches):

        start = i * batch_size
        end = start + batch_size
        dtrain = xgb.DMatrix(X_train[start:end, :], y_train[start:end])

        bst = xgb.train(
            dtrain=dtrain,
            params=xgb_train_params,
            xgb_model=XGB_MODEL_FILE or bst
        )  # type: Booster

        if XGB_MODEL_FILE:
            bst.save_model(XGB_MODEL_FILE)

    dtest = xgb.DMatrix(data=X_test, label=y_test)
    pr_y_test_hat = bst.predict(dtest)

    return pr_y_test_hat

Tests

The tests based on four datasets. I created these models:

  • xgb_native_bulk is the reference model trained on all data at once.
  • xgb_native_bulk_<N> is the model trained on a subsample of size N.
  • xgb_native_batch_<N> is the model trained continuously on all the data divided into small batches of size N (continuous learning through model updates):

Metrics:

make_classification: binary, N=3750
========================================
                       accuracy_score    aurocc
algorithm                                      
xgb_native_bulk                0.8624  0.933398
xgb_native_bulk_100            0.6192  0.669542
xgb_native_batch_100           0.6368  0.689123
xgb_native_bulk_500            0.7440  0.837590
xgb_native_batch_500           0.7528  0.829661
xgb_native_bulk_1000           0.7944  0.880586
xgb_native_batch_1000          0.8048  0.886607

load_breast_cancer: binary, N=426
========================================
                       accuracy_score    aurocc
algorithm                                      
xgb_native_bulk              0.958042  0.994902
xgb_native_bulk_100          0.930070  0.986037
xgb_native_batch_100         0.965035  0.989805
xgb_native_bulk_500          0.958042  0.994902
xgb_native_batch_500         0.958042  0.994902
xgb_native_bulk_1000         0.958042  0.994902
xgb_native_batch_1000        0.958042  0.994902

make_regression: reg, N=3750
========================================
                                mse
algorithm                          
xgb_native_bulk        5.513056e+04
xgb_native_bulk_100    1.209782e+05
xgb_native_batch_100   7.872892e+07
xgb_native_bulk_500    8.694831e+04
xgb_native_batch_500   1.150160e+05
xgb_native_bulk_1000   6.953936e+04
xgb_native_batch_1000  5.060867e+04

load_boston: reg, N=379
========================================
                             mse
algorithm                       
xgb_native_bulk        15.910990
xgb_native_bulk_100    25.160251
xgb_native_batch_100   16.931899
xgb_native_bulk_500    15.910990
xgb_native_batch_500   15.910990
xgb_native_bulk_1000   15.910990
xgb_native_batch_1000  15.910990

The problem is that incremental learning isn't doing well with long and wide datasets. For example, the classification problem:

                       accuracy_score    aurocc
algorithm                                      
xgb_native_bulk                0.8624  0.933398
xgb_native_bulk_100            0.6192  0.669542
xgb_native_batch_100           0.6368  0.689123

No difference between the models trained on 100 rows at once vs 3750 rows in batches of 100. And both are far from the reference model trained on 3750 rows at once.

References

Anton Tarasenko
  • 8,099
  • 11
  • 66
  • 91
  • why don't you try updating the parameter ? refreshing the leafs - let us know if you could answer your own question and share your learnings ! – Areza May 06 '20 at 11:03

1 Answers1

0

XGBoost requires the entire dataset in continuous learning

"Continuous training" in XGBoost refers to continuing, for example, boosting rounds, as shown in their unit tests:

These tests use the entire data even when xgb_model is specified. Then the error rate of the "full model" equals the error rate of the model trained incrementally.

When the model is updated based on subsets of the data, it will be as bad as if it had no previous training rounds.

Memory-saving incremental training occurs in discussions under the name of "external memory". Overall, the FAQ covers the issue of big datasets here.

Anton Tarasenko
  • 8,099
  • 11
  • 66
  • 91
  • does your incremental training solve cold-start problem completely? I am having issue with my Xgboost, getting cold-start score at every fold, so I come up with your thread. – ibozkurt79 Mar 28 '19 at 06:23