1

I would like to ask for help please. I use this code to run the XGboost model in the Caret package. However, I want to use the validation split based on time. I want 60% training, 20% validation ,20% testing. I already split the data, but I do know how to deal with the validation data if it is not cross-validation.

Thank you,

xgb_trainControl = trainControl(
method = "cv",
number = 5,
returnData = FALSE
)

xgb_grid <- expand.grid(nrounds = 1000,
                              eta = 0.01,
                              max_depth = 8,
                              gamma = 1,
                              colsample_bytree = 1,
                              min_child_weight = 1,
                              subsample = 1
)
set.seed(123)
xgb1 = train(sale~., data = trans_train,
  trControl = xgb_trainControl,
  tuneGrid = xgb_grid,
   method = "xgbTree",
)
xgb1
pred = predict(lm1, trans_test)
AAA
  • 157
  • 1
  • 12
  • Can you please clarify what you mean by "I want to use the validation split based on time"? Also please update your example to include the code you have used to split the data (e.g. `createDataPartition()`) – jared_mamrot Aug 05 '20 at 00:16
  • Thank you for your quick reply! I usually use this code which uses the Cross-validation technique to validate the model and the tuning parameters 80% training & 20% testing. However, I split the data by time (not randomly ) and I need to use the validation dataset to avoid overfitting. I have now three datasets 60%, 20%, 20%, but I do know how to use the validation dataset in the model. – AAA Aug 05 '20 at 00:21

1 Answers1

2

The validation partition should not be used when you are creating the model - it should be 'set aside' until the model is trained and tuned using the 'training' and 'tuning' partitions, then you can apply the model to predict the outcome of the validation dataset and summarise how accurate the predictions were.

For example, in my own work I create three partitions: training (75%), tuning (10%) and testing/validation (15%) using

# Define the partition (e.g. 75% of the data for training)
trainIndex <- createDataPartition(data$response, p = .75, 
                                  list = FALSE, 
                                  times = 1)

# Split the dataset using the defined partition
train_data <- data[trainIndex, ,drop=FALSE]
tune_plus_val_data <- data[-trainIndex, ,drop=FALSE]

# Define a new partition to split the remaining 25%
tune_plus_val_index <- createDataPartition(tune_plus_val_data$response,
                                           p = .6,
                                           list = FALSE,
                                           times = 1)

# Split the remaining ~25% of the data: 40% (tune) and 60% (val)
tune_data <- tune_plus_val_data[-tune_plus_val_index, ,drop=FALSE]
val_data <- tune_plus_val_data[tune_plus_val_index, ,drop=FALSE]

# Outcome of this section is that the data (100%) is split into:
# training (~75%)
# tuning (~10%)
# validation (~15%)

These data partitions are converted to xgb.DMatrix matrices ("dtrain", "dtune", "dval"). I then use the 'training' partition to train models and the 'tuning' partition to tune hyperparameters (e.g. random grid search) and evaluate model training (e.g. cross validation). This is ~equivalent to the code in your question.

lrn_tune <- setHyperPars(lrn, par.vals = mytune$x)
params2 <- list(booster = "gbtree",
               objective = lrn_tune$par.vals$objective,
               eta=lrn_tune$par.vals$eta, gamma=0,
               max_depth=lrn_tune$par.vals$max_depth,
               min_child_weight=lrn_tune$par.vals$min_child_weight,
               subsample = 0.8,
               colsample_bytree=lrn_tune$par.vals$colsample_bytree)

xgb2 <- xgb.train(params = params2,
                   data = dtrain, nrounds = 50,
                   watchlist = list(val=dtune, train=dtrain),
                   print_every_n = 10, early_stopping_rounds = 50,
                   maximize = FALSE, eval_metric = "error")

Once the model is trained I apply the model to the validation data with predict():

xgbpred2_keep <- predict(xgb2, dval)

xg2_val <- data.frame("Prediction" = xgbpred2_keep,
                      "Patient" = rownames(val),
                      "Response" = val_data$response)

# Reorder Patients according to Response
xg2_val$Patient <- factor(xg2_val$Patient,
                          levels = xg2_val$Patient[order(xg2_val$Response)])

ggplot(xg2_val, aes(x = Patient, y = Prediction,
                    fill = Response)) +
  geom_bar(stat = "identity") +
  theme_bw(base_size = 16) +
  labs(title=paste("Patient predictions (xgb2) for the validation dataset (n = ",
                   length(rownames(val)), ")", sep = ""), 
       subtitle="Above 0.5 = Non-Responder, Below 0.5 = Responder", 
       caption=paste("JM", Sys.Date(), sep = " "),
       x = "") +
  theme(axis.text.x = element_text(angle=90, vjust=0.5,
                                   hjust = 1, size = 8)) +
# Distance from red line = confidence of prediction
  geom_hline(yintercept = 0.5, colour = "red")


# Convert predictions to binary outcome (responder / non-responder)
xgbpred2_binary <- ifelse(predict(xgb2, dval) > 0.5,1,0)

# Results matrix (i.e. true positives/negatives & false positives/negatives)
confusionMatrix(as.factor(xgbpred2_binary), as.factor(labels_tv))


# Summary of results
Summary_of_results <- data.frame(Patient_ID = rownames(val),
                                 label = labels_tv, 
                                 pred = xgbpred2_binary)
Summary_of_results$eval <- ifelse(
  Summary_of_results$label != Summary_of_results$pred,
  "wrong",
  "correct")
Summary_of_results$conf <- round(predict(xgb2, dval), 2)
Summary_of_results$CDS <- val_data$`variants`
Summary_of_results

This provides you with a summary of how well the model 'works' on your validation data.

jared_mamrot
  • 22,354
  • 4
  • 21
  • 46
  • Thank you so much. So in my case I just need to add watchlist = list(val=dtest,train=dtrain) ? However, It did not work in my code. this is really my question on how to use both training and validation datasets in the code? similar to yours ( watchlist) – AAA Aug 05 '20 at 00:37
  • I can't tell from your code because I don't know what your data partitions are called. If you have 'trans_train', 'trans_test' and 'trans_validate', then `watchlist = list(val=trans_test, train=trans_train)` is appropriate. Then you apply the trained model using `predict(xgb1, trans_validate)` and evaluate the results with e.g. a confusion matrix – jared_mamrot Aug 05 '20 at 00:56
  • You also have some issues/errors to address e.g. expand.grid is used incorrectly; this question/answer shows a good example of how to approach this type of task correctly: https://stats.stackexchange.com/questions/171043/how-to-tune-hyperparameters-of-xgboost-trees – jared_mamrot Aug 05 '20 at 01:03
  • I can fix the expand.grid, but still do not know how to use the validation dataset the testing based on your code. I tried to use the watchlist in my code but it did not work. – AAA Aug 05 '20 at 02:03
  • If you update your question to include the code you used to split the data (like I asked in my first comment) it will help me help you. – jared_mamrot Aug 05 '20 at 02:46