Dummy variables and in- and out-of-sample prediction?

Question

I am trying to predict sales for a retail store. Here are my variables (you can largely ignore the values of the variables, outside of ZipZone; their values are largely irrelevant for this question):

storeId    sales    meanTemperature     meanHumidity    ZipZone
1          1350     56.78               61.12           0
2          1230     59.90               45.67           3
3          8476     63.54               49.87           3
4          4357     62.12               65.09           4
5          2314     69.78               68.99           4
6          7812     74.90               59.78           4
7          1350     56.78               61.12           6
8          1230     59.90               45.67           6
9          8476     63.54               49.87           6
10         4357     62.12               65.09           7
11         2314     69.78               68.99           7
12         7812     74.90               59.78           8
...

There are 50 unique storeId values (i.e. there are fifty stores). I built a regression model in the form of:

model <- lm(sales ~ meanTemperature*meanHumidity + ZipZone)

I'm currently testing this model's efficacy in terms of in- and out-of-sample prediction, so I've created inSample and outSample data frames (the former has 40 stores; the latter has 10). The issue, though, is that I have several stores in just one ZipZone. For example, the inSample table has store 1 (the only store in ZipZone 0), while the outSample table has store 12 (the only store in ZipZone 8). When I run the following:

pred <- predict(model, newdata = outSample)

I get the following error:

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :

factor ZIPzone has new levels 8

I assume this is because inSample doesn't have a store in ZipZone 8, while outSample does. How can I avoid this problem?

This answer is probably helpful for this question: http://stackoverflow.com/questions/4285214/predict-lm-with-an-unknown-factor-level-in-test-data — user2728808, Apr 02 '16 at 17:13
I don't usually upvote questions that I also close, but in this case I think the question adds something because it included the error message text, which the earlier question did not. Sometimes it's useful to have the actual error text as a search target. — IRTFM, Apr 02 '16 at 17:24
It would be very helpful if we knew what data types are your columns. You can use `dput` function to provide us with the sample data; see http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — andrechalom, Apr 02 '16 at 17:24

Dummy variables and in- and out-of-sample prediction?

0 Answers0