3

I would like to build separate models for the different segments of my data. I have built the models like so:

log1 <- glm(y ~ ., family = "binomial", data = train, subset = x1==0)
log2 <- glm(y ~ ., family = "binomial", data = train, subset = x1==1 & x2<10)
log3 <- glm(y ~ ., family = "binomial", data = train, subset = x1==1 & x2>=10)

If I run the predictions on the training data, R remembers the subsets and the prediction vectors are with the length of the respective subset.

However, if I run the predictions on the testing data, the prediction vectors are with the length of the whole dataset, not that of the subsets.

My question is whether there is a simpler way to achieve what I would by first subsetting the testing data, then running the predictions on each dataset, concatenating the predictions, rbinding the subset data, and appending the concatenated predictions like this:

T1 <- subset(Test, x1==0)
T2 <- subset(Test, x1==1 & x2<10)
T3 <- subset(Test, x1==1 & x2>=10)
log1pred <- predict(log1, newdata = T1, type = "response")
log2pred <- predict(log2, newdata = T2, type = "response")
log3pred <- predict(log3, newdata = T3, type = "response")
allpred <- c(log1pred, log2pred, log3pred)
TAll <- rbind(T1, T2, T3)
TAll$allpred <- as.data.frame(allpred)

I'd like to think I am being stupid and there is an easier way to accomplish this - many models on small subsets of the data. How to combine them to get the predictions on the full testing data?

Machavity
  • 30,841
  • 27
  • 92
  • 100
DGenchev
  • 327
  • 3
  • 12

1 Answers1

1

First, here's some sample data

set.seed(15)
train <- data.frame(x1=sample(0:1, 100, replace=T),
  x2=rpois(100,10),
  y=sample(0:1, 100, replace=T))
test <- data.frame(x1=sample(0:1, 10, replace=T),
  x2=rpois(10,10))

Now we can fit the models. Here I place them in a list to make it easier to keep them together, and I also remove x1 from the model since it will be fixed for each subset

fits<-list(
  glm(y ~ .-x1, family = "binomial", data = train, subset = x1==0),
  glm(y ~ .-x1, family = "binomial", data = train, subset = x1==1 & x2<10),
  glm(y ~ .-x1, family = "binomial", data = train, subset = x1==1 & x2>=10)
)

Now, for the training data, I create an indicator which specifies which group the observation falls into. I do this by looking at the subset= parameter of each of the calls and evaluating those conditions in the test data.

whichsubset <- as.vector(sapply(fits, function(x) {
    subsetparam<-x$call$subset
    eval(subsetparam, test)
})%*% matrix(1:length(fits), ncol=1))

You'll want to make sure your groups are mutually exclusive because this code does not check. Then you can use factor with a split/unsplit strategy for making your predictions

unsplit(
    Map(function(a,b) predict(a,b), 
        fits, split(test, whichsubset)
    ), 
    whichsubset
 )

And even easier strategy would have been just to create the segregating factor in the first place. This would make the model fitting easier as well.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • Thank you ever so much for the quick reply! This is great and it works. But if you have the time, can you elaborate on the last two blocks of code? How exactly do they work? For example, what does the `x$call$subset` and `Map(function(a,b) predict(a,b), fits, split(test, whichsubset))` do? – DGenchev Jul 23 '15 at 20:56