I applied the predict()
function in R on the linear model for the test set, but got an error saying the variables found has more rows.
In my original dataset, the training set set has 55 variables (45 binary and 10 numerical) and the test set has 52 (45 binary and 7 numerical). The training set covers the first 20 days of the month and the test has the last 10.
Purpose: I am trying to predict a variable covered by the test set and available during the training set.
I fit the training dataset using the lm()
function and predicted values for the test using predict()
. The error occurs because the training data has more observations and variables than the test.
Here’s a reproducible example getting the same error:
> #Training set
> Year <- c(0,1)
> set.seed(1)
> Year2011 <- sample(Year, size = 3000, replace = TRUE)
> Year2011 <- as.integer(Year2011)
> set.seed(3)
> Year2012 <- sample(Year, size = 3000, replace = TRUE)
> Year2012 <- as.integer(Year2012)
> Temp <- rnorm(3000, mean = 2, sd = 1)
> casual <- rnorm(3000, mean = 4, sd = 1)
> registered <- rnorm(3000, mean = 10, sd = 5)
> b <- data.frame(Year2011, Year2012, casual, Temp, registered)
EDIT: I made the columns have the same name in both test and training set, but got a new error
EDIT 2: I added a vector in b
called registered
with NA
values, buts got the same error
Solution: I added a vector of 0 values to b
called registered
.
> set.seed(4)
> Year2011 <- sample(Year, size = 1000, replace = TRUE)
> Year2011 <- as.integer(Year2011)
> set.seed(5)
> Year2012 <- sample(Year, size = 1000, replace = TRUE)
> Year2012 <- as.integer(Year2012)
> Temp <- rnorm(1000, mean = 20, sd = 5)
> a <- data.frame(Year2011, Year2012, Temp)
> #Add blank vector registered to dataset
> a$registered <- c(0)
> #Fit linear model for variable casual in training set
> mod <- lm(casual ~ ., data = b)
> #Predict variable casual and insert in test set
> a$casual <- predict(mod, a)
I saw a similar post here, but the OP had problems with renaming variables. My problem is different because I am trying to create a new column with the predicted values.