Why does my machine learning model perform poorly with batch training?

Question

My machine learning model (xgboost regressor) seems to perform worse when training in batches (i.e epochs > 1). If I change the number of epochs to 1 (i.e. no batches), my model score is near 93%. That's great However, when I set the number of batches to 25 or 100, the out of sample model score gets really bad as the number of epochs increases. By the last batch, the model out of sample score is extremely poor and cannot predict anything well! Does anyone see an issue with my code below? Thanks in advance!

Edit: genSold is a generator over my entire database.

epochs = 100
batchSize = (int)(nSold / epochs) 
print(batchSize)
print(batchSize * epochs)
model = xgboost.XGBRegressor()
for epoch in range(epochs):
    print(f"Epoch {epoch+1} of {epochs}")
    data = []
    count = 0
    for item in genSold:
        if(count == batchSize):
            break
        data.append(item)
        count += 1
    print(len(data))
    df = shuffle(pd.DataFrame(data))
    df2 = processData(df, numerical_features, categorical_features)
    df2.drop(columns=['house-id_listing'], inplace=True)
    df2 = df2.dropna(subset=prediction)
    Y = df2[prediction]
    X = df2.drop(columns=prediction)
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1)
    if epoch == 0:
        model.fit(X_train, Y_train)
    else:
        features = model.get_booster().feature_names
        print(len(features))
        model.fit(X_train[features], Y_train, xgb_model=model.get_booster())
    print(model.score(X_test, Y_test))

eschibli · Accepted Answer · 2022-04-05T18:41:24.917

2

model.fit using the sklearn API will not update existing trees, only fit new ones to the new dataset. You can train an existing model incrimentally using the python API, but as of 2018, batch training was not recommended by the devs. If you must do so, you need to pass over the entire training set at least several times to replicate the performance of training in a single batch.

Edit: This assumption was wrong.

If I understand your code correctly, you are training on the first batchSize samples in genSold repeatedly, as samples are never removed from genSold.

However, if that were the case I would expect your score reported in the last line to improve, as you shuffle each batch before splitting into train and test folds, which after the first batch, should mean you are testing on samples you have previously trained on. Do you mean it is performing poorly on a separate hold-out set?

edited Apr 05 '22 at 18:41

answered Apr 05 '22 at 17:05

eschibli

816
3
13

@eschilbli sorry, I didn’t mention that genSold is a generator over every row of data. – John Doe Apr 05 '22 at 18:20
Am I able to use xgboost to train iteratively? – John Doe Apr 05 '22 at 18:22
It's not recommended, and probably not via the sklearn API. – eschibli Apr 05 '22 at 18:50

Why does my machine learning model perform poorly with batch training?

1 Answers1