R: implementing my own gradient boosting algorithm

Question

I am trying to write my own gradient boosting algorithm. I understand there are existing packages like gbm and xgboost, but I wanted to understand how the algorithm works by writing my own.

I am using the iris data set, and my outcome is Sepal.Length (continuous). My loss function is mean(1/2*(y-yhat)^2) (basically the mean squared error with 1/2 in front), so my corresponding gradient is just the residual y - yhat. I'm initializing the predictions at 0.

library(rpart)
data(iris)

#Define gradient
grad.fun <- function(y, yhat) {return(y - yhat)}

mod <- list()

grad_boost <- function(data, learning.rate, M, grad.fun) {
  # Initialize fit to be 0
  fit <- rep(0, nrow(data))
  grad <- grad.fun(y = data$Sepal.Length, yhat = fit)

  # Initialize model
  mod[[1]] <- fit

  # Loop over a total of M iterations
  for(i in 1:M){

    # Fit base learner (tree) to the gradient
    tmp <- data$Sepal.Length
    data$Sepal.Length <- grad
    base_learner <- rpart(Sepal.Length ~ ., data = data, control = ("maxdepth = 2"))
    data$Sepal.Length <- tmp

    # Fitted values by fitting current model
    fit <- fit + learning.rate * as.vector(predict(base_learner, newdata = data))

    # Update gradient
    grad <- grad.fun(y = data$Sepal.Length, yhat = fit)

    # Store current model (index is i + 1 because i = 1 contain the initialized estiamtes)
    mod[[i + 1]] <- base_learner

  }
  return(mod)
}

With this, I split up the iris data set into a training and testing data set and fit my model to it.

train.dat <- iris[1:100, ]
test.dat <- iris[101:150, ]
learning.rate <- 0.001
M = 1000
my.model <- grad_boost(data = train.dat, learning.rate = learning.rate, M = M, grad.fun = grad.fun)

Now I calculate the predicted values from my.model. For my.model, the fitted values are 0 (vector of initial estimates) + learning.rate * predictions from tree 1 + learning rate * predictions from tree 2 + ... + learning.rate * predictions from tree M.

yhats.mymod <- apply(sapply(2:length(my.model), function(x) learning.rate * predict(my.model[[x]], newdata = test.dat)), 1, sum)

# Calculate RMSE
> sqrt(mean((test.dat$Sepal.Length - yhats.mymod)^2))
[1] 2.612972

I have a few questions

Does my gradient boosting algorithm look right?
Did I calculate the predicted values yhats.mymod correctly?

mcskinner · Accepted Answer · 2020-04-11T22:52:45.357

Yes this looks correct. At each step you are fitting to the psuedo-residuals, which are computed as the derivative of loss with respect to the fit. You have correctly derived this gradient at the start of your question, and even bothered to get the factor of 2 right.
This also looks correct. You are aggregating across the models, weighted by learning rate, just as you did during training.

But to address something that was not asked, I noticed that your training setup has a few quirks.

The iris dataset is split equally between 3 species (setosa, versicolor, virginica) and these are adjacent in the data. Your training data has all of the setosa and versicolor, while the test set has all of the virginica examples. There is no overlap, which will lead to out-of-sample problems. It is preferable to balance your training and test sets to avoid this.
The combination of learning rate and model count looks too low to me. The fit converges as (1-lr)^n. With lr = 1e-3 and n = 1000 you can only model 63.2% of the data magnitude. That is, even if every model predicts every sample correctly, you would be estimating 63.2% of the correct value. Initializing the fit with an average, instead of 0s, would help since then the effect is a regression to the mean instead of just a drag.

Thank you for your comments. Could you expand on why the "fit converges as (1-lr)^n"? What is the rationale behind this? — Adrian, Apr 26 '20 at 02:29
It is because `fit <- fit + learning.rate * prediction`, where `prediction` is the residual `target - fit`. So `fit <- fit + lr * (target - fit)`, or `fit <- fit * (1 - lr) + target * lr`. This is just an exponential moving average. Per [Wikipedia](https://en.wikipedia.org/wiki/Moving_average#Approximating_the_EMA_with_a_limited_number_of_terms), "the weight omitted by stopping after k terms is `(1-α)^k` out of the total weight" (`α` is the learning rate and `k` is `n`). You are starting with an estimate of 0 instead of the mean, so this weight omitted comes directly out of the prediction. — mcskinner, Apr 26 '20 at 06:24

R: implementing my own gradient boosting algorithm

1 Answers1