0

I am having trouble understanding which datasets: training, validation, and test need to be used for the model selection phase vs the Final Model testing phase. I try to explain as much of it in detail below while posting reproducible code at the bottom. Thank you for any and all advice / suggestions!

Let's say we use the open "Life Expectancy (WHO)" dataset available on Kaggle to create predictions on the feature Life expectancy while using RMSE as our measurement of error. (I am asking more so about the concepts behind CV here rather than targeting the lowest RMSE). We first partition a training and test set led_train and led_test from the original dataset led.

Next we create a linear model with y = Life expectancy and x = GDP with data = led_train and do the same for random forest and knn models using repeated cross validation using the Caret Package. We then run predictions with the newly created models and led_test. The RMSE can be calculated using a function of true vs predicted ratings.

I now have RMSEs of Linear Model = 9.81141, Random Forest = 9.828415, kNN = 8.923281 on the test set. Based on these values, I would obviously select the kNN Model to be my "Final Model," however I am not sure how to test it on new "unseen" data to see how well it actually performs.

Do I need to split "led" into 3 sets (training, validation, and test) then use validation for the model selection phase, saving test for the "Final Model?" Additionally, if I choose the kNN model, would I change the data inside the train function = led_train to led so that it is run on ALL of the data, after which I use the led_test for the prediction? In the Final Model, would I again set trControl and run cross validation or is this no longer necessary because this was done on the training data? Please find my reproducible code posted below (you will have to read in the .csv according to your wd) and thank you again for taking a look!

*The seed is set to 123 for reproducibility and I am running R 3.63.

library(pacman)
pacman::p_load(readr, caret, tidyverse, dplyr)

# Download the dataset:
download.file("https://raw.githubusercontent.com/christianmckinnon/StackQ/master/LifeExpectancyData.csv", "LifeExpectancyData.csv")

# Read in the data:
led <-read_csv("LifeExpectancyData.csv")

# Check for NAs
sum(is.na(led))
# Set all NAs to 0
led[is.na(led)] <- 0

# Rename `Life expectancy` to life_exp to avoid using spaces
led <-led %>% rename(life_exp = `Life expectancy`)

# Partition training and test sets
set.seed(123, sample.kind = "Rounding")
test_index <- createDataPartition(y = led$life_exp, times = 1, p = 0.2, list = F)
led_train <- led[-test_index,]
led_test <- led[test_index,]

# Add RMSE as unit of error measurement
RMSE <-function(true_ratings, predicted_ratings){
  sqrt(mean((true_ratings - predicted_ratings)^2))
}

# Create a linear model
led_lm <- lm(life_exp ~ GDP, data = led_train)
# Create prediction
lm_preds <-predict(led_lm, led_test)
# Check RMSE
RMSE(led_test$life_exp, lm_preds)
# The linear Model achieves an RMSE of 9.81141

# Create a Random Forest Model with Repeated Cross Validation
led_cv <- trainControl(method = "repeatedcv", number = 5, repeats = 3,
                      search = "random")
# Set the seed for reproducibility:
set.seed(123, sample.kind = "Rounding")
train_rf <- train(life_exp ~ GDP, data = led_train,
                  method = "rf", ntree = 150, trControl = led_cv,
                  tuneLength = 5, nSamp = 1000, 
                  preProcess = c("center","scale"))
# Create Prediction
rf_preds <-predict(train_rf, led_test)
# Check RMSE
RMSE(led_test$life_exp, rf_preds)
# The rf Model achieves an RMSE of 9.828415

# kNN Model:
knn_cv <-trainControl(method = "repeatedcv", repeats = 1)
# Set the seed for reproducibility:
set.seed(123, sample.kind = "Rounding")
train_knn <- train(life_exp ~ GDP, method = "knn", data = led_train,
                   tuneLength = 10, trControl = knn_cv,
                   preProcess = c("center","scale"))
# Create the Prediction:
knn_preds <-predict(train_knn, led_test)
# Check the RMSE:
RMSE(led_test$life_exp, knn_preds)
# The kNN model achieves the lowest RMSE of 8.923281
  • Please have a look at [Order between using validation, training and test sets](https://stackoverflow.com/questions/54126811/order-between-using-validation-training-and-test-sets) and [Should Cross Validation Score be performed on original or split data?](https://stackoverflow.com/questions/60761775/should-cross-validation-score-be-performed-on-original-or-split-data) – desertnaut Jul 29 '20 at 09:41
  • Thie example is not reproducible as led is not defined – Robert Wilson Jul 29 '20 at 10:29
  • @RobertWilson You are absolutely correct -- thank you for pointing this out. I have edited the code to read in the csv and hope that it is now reproducible! – Christian McKinnon Jul 29 '20 at 12:23

1 Answers1

0

My approach would be the following. The final model should use all of the data. I am not sure what would motivate not including all data in the final model. You are just throwing away predictive power.

For cross validation, just split the data into training and test data. Then choose the modelling method with the best performance for the full model, and then create the complete model.

The bigger problem with the current code is that the cross validation method is likely to result in two things: spurious accuracy and potentially spurious model comparisons. You need to deal with temporal autocorrelation in the cross validation. For example, if my training dataset has features for the UK for 2014 and 2016, you expect something like a random forest to be able to predict life expectancy for 2015 with high accuracy. And that is potentially all you are measuring with the current type of cross validation. Better to create a segregated dataset so that the countries in training and test are different, or splitting it into clearly distinct time periods. The exact approach would depend on exactly what you want the model to predict

Robert Wilson
  • 3,192
  • 11
  • 19
  • Thank you for your approach @RobertWilson! When you say "the final model should use all of the data," do you mean I should fit the final model as such: ```train_knn <- train(life_exp ~ GDP, method = "knn", data = led, tuneLength = 10, trControl = knn_cv, preProcess = c("center","scale"))``` Then predict again with ```knn_preds <-predict(train_knn, led_test)``` My concern is that `led` already contains the same data in `led_test` which will lead to a significantly lower RMSE on the Final Model than the previous model from the selection phase. – Christian McKinnon Jul 29 '20 at 12:33
  • Additionally, you are entirely correct about the inherent flaws of this cv method, though this question is more concerned about which sets: training, validation, or test should be used during different phases of modeling. – Christian McKinnon Jul 29 '20 at 12:39
  • It depends on what you are going to use the model for I guess, which is perhaps not clear in the original post. My view on this is that the RSME on the final model doesn't matter. It's the RMSE on the train/test model that matters. That gives you an indication of the predictive power of the model. Though of course it can be difficult to interpret. You have less data in the train, so that will result in a model with lower predictive power. But then again if your CV method is not ideal you can end up overestimating predictive power. – Robert Wilson Jul 29 '20 at 13:00