How can I train an XGBoost with a generator?

Question

I'm attempting to stack a BERT tensorflow model with and XGBoost model in python. To do this, I have trained the BERT model and and have a generator that takes the predicitons from BERT (which predicts a category) and yields a list which is the result of categorical data concatenated onto the BERT prediction. This doesn't train, however because it doesn't have a shape. The code I have is:

...
categorical_inputs=df[cat_cols]
y=pd.get_dummies(df[target_col]).values
xgboost_labels=df[target_col].values
concatenated_text_input=df['concatenated_text']
text_model.fit(tf.constant(concatenated_text_input),tf.constant(y), epochs=8)
cat_text_generator=(list(categorical_inputs.iloc[i].values)+list(text_model.predict([concatenated_text_input.iloc[i]])[0]) for i in range(len(categorical_inputs)))


clf = xgb.XGBClassifier(max_depth=200, n_estimators=400, subsample=1, learning_rate=0.07, reg_lambda=0.1, reg_alpha=0.1,\
                       gamma=1)
clf.fit(cat_text_generator, xgboost_labels)

and the error I get is:

...
-> 1153         if len(X.shape) != 2:
   1154             # Simply raise an error here since there might be many
   1155             # different ways of reshaping

AttributeError: 'generator' object has no attribute 'shape'

Although it's possible to create a list or array to hold the data, I would prefer a solution that would work for when there's too much data to hold in memory at once. Is there a way to use generators to train an xgboost model?

I haven't. Could you explain in more detail? Would the for loop generate a single row of training data, train on that one row and then move to the next row? — DrRaspberry, Aug 09 '21 at 14:14
Yeah something like that but using batches let me provide you an example. — Hakan Akgün, Aug 09 '21 at 16:34

score 3 · Accepted Answer · answered Aug 09 '21 at 16:40

3

def generator(X_data,y_data,batch_size):
    while True:
      for step in range(X_data.shape[0]//batch_size):
          start=step*batch_size
          end=step*(batch_size+1)
          current_x=X_data.iloc[start]
          current_y=y_data.iloc[start] 
          #Or if it's an numpy array just get the rows
          yield current_x,current_y

Generator=generator(X,y)
batch_size=32
number_of_steps=X.shape[0]//batch_size

clf = xgb.XGBClassifier(max_depth=200, n_estimators=400, subsample=1, learning_rate=0.07, reg_lambda=0.1, reg_alpha=0.1,\
                       gamma=1)
 
for step in number_of_steps:
    X_g,y_g=next(Generator)
    clf.fit(X_g, y_g)

answered Aug 09 '21 at 16:40

Hakan Akgün

872
5
13

You're welcome :) In order to show that the question is closed don't forget to sign the answer as solved. – Hakan Akgün Aug 10 '21 at 18:08
1

I am not sure this works as expected, see this post https://stackoverflow.com/questions/38079853/how-can-i-implement-incremental-training-for-xgboost – Phillip Maire Feb 04 '22 at 17:01
I couldn't understand why it resets the params in the for loop (as in the link says). I don't think it's the case. I think that it just updates as in any other step based gradient operation. – Hakan Akgün Feb 05 '22 at 11:09
From what I understand the model doesn't 'reset' but because the model is (iteratively) **trained in full** each time you call `.fit` the parameters are tuned for that subset. Then when you call fit again (on the next pass through your for loop), those parameters are 'overwritten' so to speak. Similar to transfer learning the old parameters are altered. i.e. if trained on the MNIST dataset for numbers 0 to 4 on the first pass and 5 to 9 on the second, the final model would do poorly on numbers 0 to 4. I want to know for sure if I am correct so anyone with more info please let me know – Phillip Maire Feb 06 '22 at 21:36
I think what we want to do is transfer learning. Each time we train our model our parameters obviously will be updated and will be overwritten I can't see the problem here. Maybe in order to explain you question more you should open a new question about it. – Hakan Akgün Feb 13 '22 at 19:43
Yeah, I think @PhillipMaire is on to something. From the xgboost docs: Note that calling ``fit()`` multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly pass ``xgb_model`` argument. – Henrik Mar 01 '23 at 20:34

score 1 · Answer 2 · answered Feb 22 '22 at 07:43

1

You can use DeviceQuantileDMatrix with a custom iterator as input. The iterator must implement xgboost.core.DataIter. Here is an example from xgboost repo:

https://github.com/dmlc/xgboost/blob/master/demo/guide-python/quantile_data_iterator.py

answered Feb 22 '22 at 07:43

Mutlu Simsek

1,088
14
22

How can I train an XGBoost with a generator?

2 Answers2