1

I have trained an XGBoost classification model for sentiment analysis of product reviews. However, there are certain cases where the model predictions are not as expected. For example, when I input the review "The delivery was a bit late but the product was awesome", the model classifies it as a negative review (0), but I want to fine-tune the model on that exact case to say the review is positive (1).

Is there a way to fine-tune the already trained XGBoost model by adding specific data points like this? What would be the best approach to achieve this without retraining the whole model from scratch?

I've tried the following function:

# Fine tune the model
def fine_tune(model, inp, output, word2vec):
    model.fit(
        np.array([word2vec.get_mean_vector(tokenize(
            inp
        ))]), np.array([output])
    )

    return model

However, when I run it it retrains the whole model on that single data point I provide it with.

Any guidance or suggestions would be greatly appreciated. Thank you!

Chris
  • 154
  • 8
  • 1
    What you're looking for is called Incremental learning, refer to [here](https://stackoverflow.com/questions/38079853/how-can-i-implement-incremental-training-for-xgboost) for an example – Laassairi Abdellah May 06 '23 at 16:42

1 Answers1

1

Thanks to @Laassairi Abdellah he was able to redirect me incremental training. Armed with that knowledge I've made this function:

import xgboost as xgb
import numpy as np

def fine_tune(model_, X, y, loop=False, num_boost_rounds=30, params=None):
    """
    Fine-tune an XGBoost model using incremental training.

    Args:
    - model_: str, xgboost.core.Booster, path / object of the model to be fine-tuned.
    - X: array-like, shape (n_samples, n_features), input data for training.
    - y: array-like, shape (n_samples,), output (target) data for training.
    - loop: bool, loop the training process until X predicts y perfectly.
    - num_boost_rounds: int, number of boosting rounds.
    - params: dict, parameters for the model.

    Returns:
    - model: the fine-tuned XGBoost model.
    """
    
    if isinstance(model_, str):
        # Load the existing model
        model = xgb.Booster()
        model.load_model(model_)
    
    elif not isinstance(model_, xgb.Booster):
        try:
            model = model_.get_booster()
        except:
            raise ValueError("The model must be either a string to a file or an XGBoost model.")

    if isinstance(model_, (xgb.Booster, str)):
        assert params is not None, "The params argument must be provided when loading a model from a file or a Booster model."

    param = params if params is not None else model_.get_params()

    # Convert the input to DMatrix
    dX = xgb.DMatrix(X, label=y)

    # Train the model
    model = xgb.train(param, dX, num_boost_rounds, xgb_model=model)

    if loop:
        # Loop the training process until the model predicts perfectly
        while True:
            y_pred = model.predict(dX)
            y_pred = np.where(y_pred > 0.5, 1, 0)

            if np.all(y_pred == y):
                break
            
            model = xgb.train(param, dX, num_boost_rounds, xgb_model=model)

    if not isinstance(model_, (str, xgb.Booster)):
        # Update the internal booster
        model_._Booster = model
    
    return model

The loop section of this code is specific to my use case of binary classification as in it is either 1 or 0.

Example usage:

fine_tune(model,
    np.array([word2vec.get_mean_vector(tokenize(
        "The delivery was a tiny bit late but the product was sleek and high quality"
    ))]), np.array([1]), loop=True
)
Chris
  • 154
  • 8