I am trying to predict sales for a retail store. Here are my variables (you can largely ignore the values of the variables, outside of ZipZone; their values are largely irrelevant for this question):
storeId sales meanTemperature meanHumidity ZipZone
1 1350 56.78 61.12 0
2 1230 59.90 45.67 3
3 8476 63.54 49.87 3
4 4357 62.12 65.09 4
5 2314 69.78 68.99 4
6 7812 74.90 59.78 4
7 1350 56.78 61.12 6
8 1230 59.90 45.67 6
9 8476 63.54 49.87 6
10 4357 62.12 65.09 7
11 2314 69.78 68.99 7
12 7812 74.90 59.78 8
...
There are 50 unique storeId
values (i.e. there are fifty stores). I built a regression model in the form of:
model <- lm(sales ~ meanTemperature*meanHumidity + ZipZone)
I'm currently testing this model's efficacy in terms of in- and out-of-sample prediction, so I've created inSample
and outSample
data frames (the former has 40 stores; the latter has 10). The issue, though, is that I have several stores in just one ZipZone
. For example, the inSample
table has store 1 (the only store in ZipZone
0), while the outSample
table has store 12 (the only store in ZipZone
8). When I run the following:
pred <- predict(model, newdata = outSample)
I get the following error:
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor ZIPzone has new levels 8
I assume this is because inSample
doesn't have a store in ZipZone
8, while outSample
does. How can I avoid this problem?