I'm currently working on a university project to predict the number of customers that arrive at a 24/7 shop. I'm using data from a shop that contains (among other things) the date and time at which every single customer of a certain year was served.
I've split this data set into a training set and a cross-validation set. Furthermore, I've aggregated and merged the training set with weather data of the same year to find out, if for example, high temperatures lead to more customers.
A simplified version of the merged data looks something like this:
| ServedCustomers | Month | Day | Hour | Temperature (C°) | Rain(binary)
| --------------- | ----- | --- | ---- | ---------------- | ------------
| 1 | 12 | 31 | 12 | 9.2 | 0
| 0 | 12 | 31 | 13 | 9.8 | 1
| 2 | 12 | 31 | 14 | 10.1 | 0
For every hour of the year, I have the number of customers that were served as well as the corresponding weather data.
I've created a multiple linear regression model in R to predict the number of customers with pretty much every other variable as a predictor. Using the summary()
command, the MSE, R^2 and other statistics are looking okay so far.
Now I want to check if the same model works with the cross-validation set as well. For that, I've merged with the same weather data to obtain a data set that has the same structure as the table above, only with different numbers of customers.
However, that's where I'm currently stuck. Using the predict.lm()
function with the model and cross-validation set does seem to work, but only yields the predicted values and little additional information.
Is there some way to create a summary of how well the model works for the other data set? Similar to the summary()
command, but for a data set that the linear model wasn't originally based on?