I'm developing a CTR prediction model for the Kaggle competition (link). I've read in the first 100,000 lines of data from the training set, then further split this into train/test sets at 80/20 by
ad_data <- read.csv("train", header = TRUE, stringsAsFactors = FALSE, nrows = 100000)
trainIndex <- createDataPartition(ad_data$click, p=0.8, list=FALSE, times=1)
ad_train <- ad_data[trainIndex,]
ad_test <- ad_data[-trainIndex,]
then I used the ad_train data to develop a GLM model
ad_glm_model <- glm(ad_train$clicks ~ ad_train$C1 + ad_train$site_category + ad_train$device_type, family = binomial(link = "logit"), data = ad_train)
But whenever I try to use the predict function to check out how well it does on the ad_test set, I get the error:
test_model <- predict(ad_glm_model, newdata = ad_test, type = "response")
Warning message:
'newdata' had 20000 rows but variables found have 80000 rows
What gives? How do I test my GLM model on new data?
EDIT: It works perfectly. Just need to do this call instead:
ad_glm_model <- glm(clicks ~ C1 + site_category + device_type, family = binomial(link = "logit"), data = ad_train)