2

I am building a linear regression on the cafe dataset and I want to validate the results by calculationg Leave-One-Out CrossValidation.

I wrote my own function for that, which works if I fit lm() on all data, but when I am using subset of columns (from Stepwise regression), I am getting an error. Consider the following code:

cafe <- read.table("C:/.../cafedata.txt", header=T)

cafe$Date <- as.Date(cafe$Date, format="%d/%m/%Y")

#Delete row 34
cafe <- cafe[-c(34), ]
#wont need date
cafe <- cafe[,-1]

library(DAAG)
#center the data
cafe.c <- data.frame(lapply(cafe[,2:15], function(x) scale(x, center = FALSE, scale = max(x, na.rm = TRUE))))
cafe.c$Day.of.Week <- cafe$Day.of.Week
cafe.c$Sales <- cafe$Sales

#Leave-One-Out CrossValidation function
LOOCV <- function(fit, dataset){
  # Attributes:
  #------------------------------
  # fit: Fit of the model
  # dataset: Dataset to be used
  # -----------------------------
  # Returns mean of squared errors for each fold - MSE
  MSEP_=c()

  for (idx in 1:nrow(dataset)){
    train <- dataset[-c(idx),]
    test <- dataset[idx,]

    MSEP_[idx]<-(predict(fit, newdata = test) -  dataset[idx,]$Sales)^2
  }
  return(mean(MSEP_))
}

Then when I fit the simple linear model and call the function, it works:

#----------------Simple Linear regression with all attributes-----------------
fit.all.c <- lm(cafe.c$Sales ~., data=cafe.c)

#MSE:258
LOOCV(fit.all.c, cafe.c)

However when I fit the same lm() only with subset of columns, I get an error:

#-------------------------Linear Stepwise regression--------------------------
step <- stepAIC(fit.all.c, direction="both")

fit.step <- lm(cafe.c$Sales ~ cafe.c$Bread.Sand.Sold + cafe.c$Bread.Sand.Waste
               + cafe.c$Wraps.Waste + cafe.c$Muffins.Sold 
               + cafe.c$Muffins.Waste + cafe.c$Fruit.Cup.Sold 
               + cafe.c$Chips + cafe.c$Sodas + cafe.c$Coffees 
               + cafe.c$Day.of.Week,data=cafe.c)

LOOCV(fit.step, cafe.c)

5495.069

There were 50 or more warnings (use warnings() to see the first 50)

If I look closer:

idx <- 1
train <- cafe.c[-c(idx)]
test <- cafe.c[idx]

(predict(fit.step, newdata = test) -cafe.c[idx]$Sales)^2

I get MSE for all rows and an error:

Warning message: 'newdata' had 1 row but variables found have 47 rows

EDIT

I have found this question about the error, which says that this error occurs when I give different names to the columns, however this is not the case.

Community
  • 1
  • 1
HonzaB
  • 7,065
  • 6
  • 31
  • 42
  • Haven't thought a lot about it but what first comes to my mind is that maybe LOOCV calls predict.lm and it can't do predictions based on the dataset cafe.c because the variables names that u used in the lm were passed via cafe.c$some_variable. Try the same formula just writing some_variable instead of cafe.c$some_variable, maybe it works...! – Gere Caste Jan 09 '17 at 08:17

1 Answers1

1

Change your code like the following:

fit.step <- lm(Sales ~ Bread.Sand.Sold + Bread.Sand.Waste
               + Wraps.Waste + Muffins.Sold 
               + Muffins.Waste + Fruit.Cup.Sold 
               + Chips + Sodas + Coffees 
               + Day.of.Week,data=cafe.c)

LOOCV(fit.step, cafe.c)
# [1] 278.8984

idx <- 1
train <- cafe.c[-c(idx),]
test <- cafe.c[idx,] # need to select the row, not the column

(predict(fit.step, newdata = test) -cafe.c[idx,]$Sales)^2
#    1 
# 51.8022 

Also, you LOOCV implementation is not correct. You must fit a new model everytime on the train dataset on the leave-1-out fold. Right now you are training the model once on the entire dataset and using the same single model to compute the MSE on held out test dataset for each leave-1-out fold, but ideally you should have different models trained on different training datasets.

Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63
  • Thank you! I also saw that it is not correct, I will try to fix it. Your proposed solution worked. – HonzaB Jan 09 '17 at 10:26