1

So I have a small data set which should be great for modeling (<1 million records), but one variable is giving me problems. It's a categorical variable with ~98 levels called [store] - this is the name of each store. I am trying to predict each stores sales [sales] which is a continuous numeric variable. So the vector size is over 10GB and crashes with memory errors in R. Is it possible to make 98 different regression equations, and run them one by one for every level of [store]?

My other idea would be to try and create 10 or 15 clusters of this [store] variable, then use the cluster names as my categorical variable in predicting the [sales] variable (continuous variable).

barker
  • 1,005
  • 18
  • 36

1 Answers1

2

Sure, this is a pretty common type of analysis. For instance, here is how you would split up the iris dataset by the Species variable and then build a separate model predicting Sepal.Width from Sepal.Length in each subset:

data(iris)
models <- lapply(split(iris, iris$Species), function(df) lm(Sepal.Width~Sepal.Length, data=df))

The result is a list of the species-specific regression models.

To predict, I think it would be most efficient to first split your test set, then call the corresponding prediction function on each subset, and finally recombine:

test.iris <- iris
test.spl <- split(test.iris, test.iris$Species)
predictions <- unlist(lapply(test.spl, function(df) {
  predict(models[[df$Species[1]]], newdata=df)
}))
test.ordered <- do.call(rbind, test.spl)  # Test obs. in same order as predictions

Of course, for your problem you'll need to decide how to subset the data. One reasonable approach would be clustering with something like kmeans and the passing the cluster of each point to the split function.

josliber
  • 43,891
  • 12
  • 98
  • 133
  • Thanks, I think this is exactly what I'm looking for. Is the predict() functions straightforward or is there a tricky way to combine all of these different models? (I will have ~98 of them) – barker Feb 25 '14 at 17:17
  • @user2228155 I updated the post to include how to invoke the `predict` function. – josliber Feb 25 '14 at 17:28
  • Thanks, do you think this will help with memory management? The only reason i'm doing this approach is because I don't have enough memory to handle all the data. So will R calculate each lm one by one, reducing the total memory used at any point in time? – barker Feb 25 '14 at 20:32
  • You'll be running regression models on much smaller datasets, so that step will be less memory intensive. However, when you use `split` to break up your dataset into subsets, you've essentially stored your whole dataset a second time. I would suggest running the code and see if it helps with your memory concerns. – josliber Feb 25 '14 at 20:39
  • tested it, got this models <- lapply(split(deltrain, deltrain$Storechar), function(df) + lm(Weekly_Sales ~ CPI + Unemployment + Size + Type + Temperature + Fuel_Price, data=df)) Show Traceback Rerun with Debug Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels – barker Feb 28 '14 at 21:26
  • In one of the subsets of the data, a factor variable is now taking on only a single value. Since the topic of this question is "how do I perform cluster-then-predict on my dataset," I think you should post a reproducible example with this new issue as a separate question. – josliber Feb 28 '14 at 21:35
  • I was trying to perform my original question, making 98 different regression equations. – barker Mar 01 '14 at 01:55
  • @user2228155 the code I posted is clearly working code on the `iris` dataset (you can copy the code I provided into your R terminal and run it successfully). It sounds like it doesn't work on your dataset. We're not going to make progress unless you provide a minimum reproducible example. Here's a good place to start: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – josliber Mar 01 '14 at 02:02