My machine learning model (xgboost regressor) seems to perform worse when training in batches (i.e epochs > 1). If I change the number of epochs to 1 (i.e. no batches), my model score is near 93%. That's great However, when I set the number of batches to 25 or 100, the out of sample model score gets really bad as the number of epochs increases. By the last batch, the model out of sample score is extremely poor and cannot predict anything well! Does anyone see an issue with my code below? Thanks in advance!
Edit: genSold is a generator over my entire database.
epochs = 100
batchSize = (int)(nSold / epochs)
print(batchSize)
print(batchSize * epochs)
model = xgboost.XGBRegressor()
for epoch in range(epochs):
print(f"Epoch {epoch+1} of {epochs}")
data = []
count = 0
for item in genSold:
if(count == batchSize):
break
data.append(item)
count += 1
print(len(data))
df = shuffle(pd.DataFrame(data))
df2 = processData(df, numerical_features, categorical_features)
df2.drop(columns=['house-id_listing'], inplace=True)
df2 = df2.dropna(subset=prediction)
Y = df2[prediction]
X = df2.drop(columns=prediction)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1)
if epoch == 0:
model.fit(X_train, Y_train)
else:
features = model.get_booster().feature_names
print(len(features))
model.fit(X_train[features], Y_train, xgb_model=model.get_booster())
print(model.score(X_test, Y_test))