0

I'm developing a CTR prediction model for the Kaggle competition (link). I've read in the first 100,000 lines of data from the training set, then further split this into train/test sets at 80/20 by

ad_data <- read.csv("train", header = TRUE, stringsAsFactors = FALSE, nrows = 100000)
trainIndex <- createDataPartition(ad_data$click, p=0.8, list=FALSE, times=1)
ad_train <- ad_data[trainIndex,]
ad_test <- ad_data[-trainIndex,]

then I used the ad_train data to develop a GLM model

ad_glm_model <- glm(ad_train$clicks ~ ad_train$C1 + ad_train$site_category + ad_train$device_type, family = binomial(link = "logit"), data = ad_train)

But whenever I try to use the predict function to check out how well it does on the ad_test set, I get the error:

test_model <- predict(ad_glm_model, newdata = ad_test, type = "response")
Warning message:
'newdata' had 20000 rows but variables found have 80000 rows 

What gives? How do I test my GLM model on new data?

EDIT: It works perfectly. Just need to do this call instead:

ad_glm_model <- glm(clicks ~ C1 + site_category + device_type, family = binomial(link = "logit"), data = ad_train)

1 Answers1

0

This is happening because you are including the name of the data frame for each variable in the model formula. Instead, your formula should be:

glm(clicks ~ C1 + site_category + device_type, family = binomial(link = "logit"), data = ad_train)

As described in second link in the duplicate notification:

This is a problem of using different names between your data and your newdata and not a problem between using vectors or dataframes.

When you fit a model with the lm function and then use predict to make predictions, predict tries to find the same names on your newdata. In your first case name x conflicts with mtcars$wt and hence you get the warning.

Community
  • 1
  • 1
eipi10
  • 91,525
  • 24
  • 209
  • 285