59

The problem is that my train data could not be placed into RAM due to train data size. So I need a method which first builds one tree on whole train data set, calculate residuals build another tree and so on (like gradient boosted tree do). Obviously if I call model = xgb.train(param, batch_dtrain, 2) in some loop - it will not help, because in such case it just rebuilds whole model for each batch.

Alain
  • 853
  • 11
  • 10
Marat Zakirov
  • 905
  • 1
  • 8
  • 13

10 Answers10

55

Try saving your model after you train on the first batch. Then, on successive runs, provide the xgb.train method with the filepath of the saved model.

Here's a small experiment that I ran to convince myself that it works:

First, split the boston dataset into training and testing sets. Then split the training set into halves. Fit a model with the first half and get a score that will serve as a benchmark. Then fit two models with the second half; one model will have the additional parameter xgb_model. If passing in the extra parameter didn't make a difference, then we would expect their scores to be similar.. But, fortunately, the new model seems to perform much better than the first.

import xgboost as xgb
from sklearn.cross_validation import train_test_split as ttsplit
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error as mse

X = load_boston()['data']
y = load_boston()['target']

# split data into training and testing sets
# then split training set in half
X_train, X_test, y_train, y_test = ttsplit(X, y, test_size=0.1, random_state=0)
X_train_1, X_train_2, y_train_1, y_train_2 = ttsplit(X_train, 
                                                     y_train, 
                                                     test_size=0.5,
                                                     random_state=0)

xg_train_1 = xgb.DMatrix(X_train_1, label=y_train_1)
xg_train_2 = xgb.DMatrix(X_train_2, label=y_train_2)
xg_test = xgb.DMatrix(X_test, label=y_test)

params = {'objective': 'reg:linear', 'verbose': False}
model_1 = xgb.train(params, xg_train_1, 30)
model_1.save_model('model_1.model')

# ================= train two versions of the model =====================#
model_2_v1 = xgb.train(params, xg_train_2, 30)
model_2_v2 = xgb.train(params, xg_train_2, 30, xgb_model='model_1.model')

print(mse(model_1.predict(xg_test), y_test))     # benchmark
print(mse(model_2_v1.predict(xg_test), y_test))  # "before"
print(mse(model_2_v2.predict(xg_test), y_test))  # "after"

# 23.0475232194
# 39.6776876084
# 27.2053239482

reference: https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/training.py

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Alain
  • 853
  • 11
  • 10
  • 3
    I would understand that model_2_v2 performs worse than model which used both datsets at once. But model_2_v2 is worse than model_1 which is pretty strange because we give new data set which model_1 didn't see but at the end model_2_v2 it performed worse... It seems that boosted trees is not best way to perform incremental learning. @pikachau did you try use model_1 instead of 'experiment.model'? – Marat Zakirov Jul 08 '16 at 07:57
  • It might be because the dataset is pretty small (sample size = 150). With a larger dataset, I think model_2_v2 should outperform model_1. Oh, experiment.model == model_1; I should have made that more explicit! – Alain Jul 08 '16 at 08:09
  • 1
    Should the result of model_2_v_2 be the same as a model trained on the entire train set (train_1 and train_2)? – Itamar Mushkin Jun 20 '21 at 13:44
  • 4
    The lead maintainer of XGBoost is cited here saying that this is not the correct usage and will not result in the expected behavior. iterative training does not seem to be possible with XGBoost. https://github.com/dmlc/xgboost/issues/3055#issuecomment-359648107. also see https://datascience.stackexchange.com/questions/47510/how-to-reach-continue-training-in-xgboost Also see https://datascience.stackexchange.com/questions/47510/how-to-reach-continue-training-in-xgboost https://stackoverflow.com/questions/48366379/reduce-error-rates-from-incremental-training-in-xgboost-in-python – Phillip Maire Feb 09 '22 at 07:49
24

There is now (version 0.6?) a process_update parameter that might help. Here's an experiment with it:

import pandas as pd
import xgboost as xgb
from sklearn.model_selection import ShuffleSplit
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error as mse

boston = load_boston()
features = boston.feature_names
X = boston.data
y = boston.target

X=pd.DataFrame(X,columns=features)
y = pd.Series(y,index=X.index)

# split data into training and testing sets
rs = ShuffleSplit(test_size=0.3, n_splits=1, random_state=0)
for train_idx,test_idx in rs.split(X):  # this looks silly
    pass

train_split = round(len(train_idx) / 2)
train1_idx = train_idx[:train_split]
train2_idx = train_idx[train_split:]
X_train = X.loc[train_idx]
X_train_1 = X.loc[train1_idx]
X_train_2 = X.loc[train2_idx]
X_test = X.loc[test_idx]
y_train = y.loc[train_idx]
y_train_1 = y.loc[train1_idx]
y_train_2 = y.loc[train2_idx]
y_test = y.loc[test_idx]

xg_train_0 = xgb.DMatrix(X_train, label=y_train)
xg_train_1 = xgb.DMatrix(X_train_1, label=y_train_1)
xg_train_2 = xgb.DMatrix(X_train_2, label=y_train_2)
xg_test = xgb.DMatrix(X_test, label=y_test)

params = {'objective': 'reg:linear', 'verbose': False}
model_0 = xgb.train(params, xg_train_0, 30)
model_1 = xgb.train(params, xg_train_1, 30)
model_1.save_model('model_1.model')
model_2_v1 = xgb.train(params, xg_train_2, 30)
model_2_v2 = xgb.train(params, xg_train_2, 30, xgb_model=model_1)

params.update({'process_type': 'update',
               'updater'     : 'refresh',
               'refresh_leaf': True})
model_2_v2_update = xgb.train(params, xg_train_2, 30, xgb_model=model_1)

print('full train\t',mse(model_0.predict(xg_test), y_test)) # benchmark
print('model 1 \t',mse(model_1.predict(xg_test), y_test))  
print('model 2 \t',mse(model_2_v1.predict(xg_test), y_test))  # "before"
print('model 1+2\t',mse(model_2_v2.predict(xg_test), y_test))  # "after"
print('model 1+update2\t',mse(model_2_v2_update.predict(xg_test), y_test))  # "after"

Output:

full train   17.8364309709
model 1      24.2542132108
model 2      25.6967017352
model 1+2    22.8846455135
model 1+update2  14.2816257268
paulperry
  • 826
  • 8
  • 16
  • Which one is the final model or the one I should use? – tumbleweed Mar 28 '17 at 13:37
  • 3
    You want the model with the lowest MSE. But note how the 1+update2 is lower than the full train! It's not clear to me why that should be the case, so I would be suspicious of this result and run a CV with more folds. – paulperry Mar 29 '17 at 14:30
  • 1
    This doesn't seem to work with `'objective': 'binary:logistic'` – Bastiaan Dec 27 '18 at 06:50
  • If I train with two iterations I get an AUC of 0.66 and 0.68 for the successive iterations. Then when training the next minibatch with the exact same data I get the exact same AUCs. I would expect, when continuing on existing model, that AUCs would further improvements or maybe stay the same. From getting the same AUCs I conclude it doesn't continue learning, just start over. – Bastiaan Dec 27 '18 at 19:14
  • 'process_type': 'update' - gives me error in xgboost 1.9 – Areza May 08 '20 at 19:17
20

I created a gist of jupyter notebook to demonstrate that xgboost model can be trained incrementally. I used boston dataset to train the model. I did 3 experiments - one shot learning, iterative one shot learning, iterative incremental learning. In incremental training, I passed the boston data to the model in batches of size 50.

The gist of the gist is that you'll have to iterate over the data multiple times for the model to converge to the accuracy attained by one shot (all data) learning.

Here is the corresponding code for doing iterative incremental learning with xgboost.

batch_size = 50
iterations = 25
model = None
for i in range(iterations):
    for start in range(0, len(x_tr), batch_size):
        model = xgb.train({
            'learning_rate': 0.007,
            'update':'refresh',
            'process_type': 'update',
            'refresh_leaf': True,
            #'reg_lambda': 3,  # L2
            'reg_alpha': 3,  # L1
            'silent': False,
        }, dtrain=xgb.DMatrix(x_tr[start:start+batch_size], y_tr[start:start+batch_size]), xgb_model=model)

        y_pr = model.predict(xgb.DMatrix(x_te))
        #print('    MSE itr@{}: {}'.format(int(start/batch_size), sklearn.metrics.mean_squared_error(y_te, y_pr)))
    print('MSE itr@{}: {}'.format(i, sklearn.metrics.mean_squared_error(y_te, y_pr)))

y_pr = model.predict(xgb.DMatrix(x_te))
print('MSE at the end: {}'.format(sklearn.metrics.mean_squared_error(y_te, y_pr)))

XGBoost version: 0.6

Shubham Chaudhary
  • 47,722
  • 9
  • 78
  • 80
  • Thanks for the notebook - I wonder if it makes sense to only increment if the model performance is above a threshold in the inner for loop ? so we don't increment every step but some of the steps where the model has done well ? Also what is your strategy to for hyper parameter tuning ? – Areza Jun 12 '20 at 07:23
13

looks like you don't need anything other than call your xgb.train(....) again but provide the model result from the previous batch:

# python
params = {} # your params here
ith_batch = 0
n_batches = 100
model = None
while ith_batch < n_batches:
    d_train = getBatchData(ith_batch)
    model = xgb.train(params, d_train, xgb_model=model)
    ith_batch += 1

this is based on https://xgboost.readthedocs.io/en/latest/python/python_api.html enter image description here

Mobigital
  • 749
  • 7
  • 14
  • 2
    imho training continuation is not equivalent to incremental learning. Think training iteration vs online learning. – shamalaia Aug 25 '22 at 06:39
4

If your problem is regarding the dataset size and you do not really need Incremental Learning (you are not dealing with an Streaming app, for instance), then you should check out Spark or Flink.

This two frameworks can train on very large datasets with a small RAM, leveraging disk memory. Both framework deal with memory issues internally. While Flink had it solved first, Spark has caught up in recent releases.

Take a look at:

  • both your links redirect to "https://softcloudtech.com/cloud-computing/" for some reason (at least for me). I think there are correct https://xgboost.ai/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html and https://xgboost.ai/2016/10/26/a-full-integration-of-xgboost-and-spark.html#:~:text=The%20integrations%20with%20Spark%2FFlink,widely%2Ddeployed%20frameworks%20like%20Spark. also thanks for the suggestion I need to train a large amount of features output from a CNN so this helps! – Phillip Maire Feb 04 '22 at 17:30
0

To paulperry's code, If change one line from "train_split = round(len(train_idx) / 2)" to "train_split = len(train_idx) - 50". model 1+update2 will changed from 14.2816257268 to 45.60806270012028. And a lot of "leaf=0" result in dump file.

Updated model is not good when update sample set is relative small. For binary:logistic, updated model is unusable when update sample set has only one class.

Tao Cheng
  • 111
  • 1
  • 4
0

One possible solution that I have not tested is to used a dask dataframe which should act the same as a pandas dataframe but (I assume) utilize disk and reads in and out of RAM. here are some helpful links. this link mentions how to use it with xgboost also see also see. further there is an experimental options from XGBoost as well here but it is "not ready for production"

Phillip Maire
  • 323
  • 2
  • 10
0

It's not based on xgboost, but there is a C++ incremental decision tree.
see gaenari.

Continuous chunking data can be inserted and updated, and rebuilds can be run if concept drift reduces accuracy.

greenfish
  • 29
  • 4
0

I agree with @desertnaut in his solution.

I have a dataset where I split it into 4 batches. I have to do an initial fit without the xgb_model parameter first, then the next fits will have the xgb_model parameter, like in this (I'm using the Sklearn API):

for i, (X_batch, y_batch) in enumerate(zip(self.X_train_batched, self.y_train_batched)):
    print(f'Step: {i}',end = ' ')
    if i == 0:
        model_xgbc.fit(X_batch, y_batch, eval_set=[(self.X_valid, self.y_valid)],
                        verbose=False, eval_metric = ['logloss'],
                        early_stopping_rounds = 400)
    else:
        model_xgbc.fit(X_batch, y_batch, eval_set=[(self.X_valid, self.y_valid)],
                        verbose=False, eval_metric = ['logloss'],
                        early_stopping_rounds = 400, xgb_model=model_xgbc)
            
    preds = model_xgbc.predict(self.X_valid)
    
    rmse = metrics.mean_squared_error(self.y_valid, preds,squared=False)
J R
  • 436
  • 3
  • 7
-1
Hey guys you can use my simple code for incremental model training with xgb base class :

    batch_size = 10000000


    X_train="your pandas training DataFrame" 
    y_train="Your lables"
    
    #Store eval results
    evals_result={}
    Deval = xgb.DMatrix(X_valid, y_valid)
    eval_sets = [(Dtrain, 'train'), (Deval, 'eval')]
    for start in range(0, n, batch_size):
           model = xgb.train({'refresh_leaf': True, 
                         'process_type': 'default', 
                         'max_depth': 5, 
                         'objective': 'reg:squarederror', 
                         'num_parallel_tree': 2,
                        'learning_rate':0.05,
                        'n_jobs':-1},
                        dtrain=xgb.DMatrix(X_train, y_train), evals=eval_sets, early_stopping_rounds=5,num_boost_round=100,evals_result=evals_result,xgb_model=model)