Cross validation for a multiple linear regression in R

Question

I'm currently working on a university project to predict the number of customers that arrive at a 24/7 shop. I'm using data from a shop that contains (among other things) the date and time at which every single customer of a certain year was served.

I've split this data set into a training set and a cross-validation set. Furthermore, I've aggregated and merged the training set with weather data of the same year to find out, if for example, high temperatures lead to more customers.

A simplified version of the merged data looks something like this:

| ServedCustomers | Month | Day | Hour | Temperature (C°) | Rain(binary)
| --------------- | ----- | --- | ---- | ---------------- | ------------
| 1               | 12    | 31  | 12   | 9.2              | 0
| 0               | 12    | 31  | 13   | 9.8              | 1
| 2               | 12    | 31  | 14   | 10.1             | 0

For every hour of the year, I have the number of customers that were served as well as the corresponding weather data.

I've created a multiple linear regression model in R to predict the number of customers with pretty much every other variable as a predictor. Using the summary() command, the MSE, R^2 and other statistics are looking okay so far.

Now I want to check if the same model works with the cross-validation set as well. For that, I've merged with the same weather data to obtain a data set that has the same structure as the table above, only with different numbers of customers.

However, that's where I'm currently stuck. Using the predict.lm() function with the model and cross-validation set does seem to work, but only yields the predicted values and little additional information.

Is there some way to create a summary of how well the model works for the other data set? Similar to the summary() command, but for a data set that the linear model wasn't originally based on?

score 0 · Accepted Answer · answered Jan 03 '17 at 23:01

You can calculate the mean squared error and the root mean squared error to see how well your model did.

1) Take your coefficients and multiply them by your matrix of covariates in your training data. yhat = (X*b)

2) Take your training set y's and take the difference between these and the yhat above.

3) Square the error

4) Take the square root of the answer = Root Mean Squared Error

Lower values means better fit overall

Cross validation for a multiple linear regression in R

1 Answers1