0

I applied the predict() function in R on the linear model for the test set, but got an error saying the variables found has more rows.

In my original dataset, the training set set has 55 variables (45 binary and 10 numerical) and the test set has 52 (45 binary and 7 numerical). The training set covers the first 20 days of the month and the test has the last 10.

Purpose: I am trying to predict a variable covered by the test set and available during the training set.

I fit the training dataset using the lm() function and predicted values for the test using predict(). The error occurs because the training data has more observations and variables than the test.

Here’s a reproducible example getting the same error:

> #Training set
> Year <- c(0,1)
> set.seed(1)
> Year2011 <- sample(Year, size = 3000, replace = TRUE)
> Year2011 <- as.integer(Year2011)
> set.seed(3)
> Year2012 <- sample(Year, size = 3000, replace = TRUE)
> Year2012 <- as.integer(Year2012)
> Temp <- rnorm(3000, mean = 2, sd = 1)
> casual <- rnorm(3000, mean = 4, sd = 1) 
> registered <- rnorm(3000, mean = 10, sd = 5) 
> b <- data.frame(Year2011, Year2012, casual, Temp, registered)

EDIT: I made the columns have the same name in both test and training set, but got a new error

EDIT 2: I added a vector in b called registered with NA values, buts got the same error

Solution: I added a vector of 0 values to b called registered.

> set.seed(4)
> Year2011 <- sample(Year, size = 1000, replace = TRUE)
> Year2011 <- as.integer(Year2011)
> set.seed(5)
> Year2012 <- sample(Year, size = 1000, replace = TRUE)
> Year2012 <- as.integer(Year2012)
> Temp <- rnorm(1000, mean = 20, sd = 5)
> a <- data.frame(Year2011, Year2012, Temp)  
> #Add blank vector registered to dataset
> a$registered <- c(0)
> #Fit linear model for variable casual in training set
> mod <- lm(casual ~ ., data = b)
> #Predict variable casual and insert in test set
> a$casual <- predict(mod, a)

I saw a similar post here, but the OP had problems with renaming variables. My problem is different because I am trying to create a new column with the predicted values.

Community
  • 1
  • 1
Scott Davis
  • 983
  • 6
  • 22
  • 43
  • 1
    The column names that you used in your model must match the column names you use in your newdata. Here `a` and `b` have different column names (ie `Year2011` vs `Year2011s`) – MrFlick May 23 '15 at 17:48
  • I made the edit @MrFlick but got an error about the variable lengths being different. – Scott Davis May 23 '15 at 18:15
  • 1
    Now `b` has a column named `registered` that you used when you fit your model, but `a` does not have that column. You need to make sure the column names that you used in your model are all present in your newdata. – MrFlick May 23 '15 at 18:27
  • The duplicate is about a warning. It can't solve the error of the revised question. To be reopened. – Christophe May 23 '15 at 19:56
  • @MrFlick I tried putting in a new vector called casual using a$casual. I am not sure why that did not imput the vector of predicted values into a. For the edit, I tried putting a new vector in b with NA values, but still got an error. – Scott Davis May 24 '15 at 02:27
  • 1
    @ScottDavis your error has nothing to do with the causal column (that actually does not need to be in `a`), as my last comment said, it has to do with the `registered` variable that you have in `b` that you used to fit the model (ie it's listed in `coef(mod)`) but you do not have in `a` thus you cannot predict values from `a` because you are missing an important covariate. – MrFlick May 24 '15 at 02:37
  • @MrFlick thank you that fixed the error. I put in the independent variable that was left out of the test dataset called registered. – Scott Davis May 24 '15 at 03:21

0 Answers0