0

I'm a beginner with machine learning (and also R). I've figured out how to run some basic linear regression, elastic net, and random forest models in R and have gotten some decent results for a regression project (with a continuous dependent variable) that I'm working on.

I've been trying to learning how to use the gradient boosting algorithm and, in particular, the xgboost() command. My results are way worse here, though, and I'm not sure why.

I was hoping someone could take a look at my code and see if there are any glaring errors.

# Create training data with and without the dependent variable
train <- data[1:split, ]
train.treat <- select(train, -c(y))

# Create test data with and without the dependent variable
test <- data[(split+1):nrow(data), ]
test.treat <- select(test, -c(y))

# Load the package xgboost
library(xgboost)

# Run xgb.cv
cv <- xgb.cv(data = as.matrix(train.treat), 
             label = train$y,
             nrounds = 100,
             nfold = 10,
             objective = "reg:linear",
             eta = 0.1,
             max_depth = 6,
             early_stopping_rounds = 10,
             verbose = 0   # silent
)

# Get the evaluation log
elog <- cv$evaluation_log

# Determine and print how many trees minimize training and test error
elog %>% 
  summarize(ntrees.train = which.min(train_rmse_mean),   # find the index of min(train_rmse_mean)
            ntrees.test  = which.min(test_rmse_mean))    # find the index of min(test_rmse_mean)


# The number of trees to use, as determined by xgb.cv
ntrees <- 25

# Run xgboost
model_xgb <- xgboost(data = as.matrix(train.treat), # training data as matrix
                          label = train$y,  # column of outcomes
                          nrounds = ntrees,       # number of trees to build
                          objective = "reg:linear", # objective
                          eta = 0.001,
                          depth = 10,
                          verbose = 0  # silent
)

# Make predictions
test$pred <- predict(model_xgb, as.matrix(test.treat))

# Plot predictions vs actual bike rental count
ggplot(test, aes(x = pred, y = y)) + 
  geom_point() + 
  geom_abline()

# Calculate RMSE
test %>%
  mutate(residuals = y - pred) %>%
  summarize(rmse = sqrt(mean(residuals^2)))

How does this look?

Also, one thing I don't get about xgboost() is why I have to take out the dependent variable from the dataset in the "data" option and then add it back in the "label" option. Why do we do this?


My dataset has 809 observations and 108 independent variables. Here is an arbitrary subset:

structure(list(year = c(2019, 2019, 2019, 2019), ht = c(74, 76, 
74, 73), wt = c(223, 234, 215, 215), age = c(36, 29, 32, 24), 
    gp_l1 = c(16, 16, 11, 14), gp_l2 = c(7, 0, 16, 0), gp_l3 = c(16, 
    15, 16, 0), gs_l1 = c(16, 16, 11, 13), gs_l2 = c(7, 0, 16, 
    0), gs_l3 = c(16, 15, 16, 0), cmp_l1 = c(372, 430, 226, 310
    ), cmp_l2 = c(154, 0, 297, 0), cmp_l3 = c(401, 346, 364, 
    0), att_l1 = c(597, 639, 365, 486), y = c(8, 71.5, 26, 22
    )), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))

My RMSE from this xgboost() model is 31.7. Whereas my random forest and glmnet models give RMSEs around 13. The prediction metric I'm comparing to has RMSE of 15.5. I don't get why my xgboost() model does so much worse than my random forest and glmnet models.

  • how much data do you have? can you dput the data and make it [reproducible](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)? What are the "decent previous results"? What are the results of this script? – AidanGawronski Feb 06 '20 at 02:58
  • @AidanGawronski see above - thanks! –  Feb 08 '20 at 01:02

0 Answers0