18

I have data which I regularly run regressions on. Each "chunk" of data gets fit a different regression. Each state, for example, might have a different function that explains the dependent value. This seems like a typical "split-apply-combine" type of problem so I'm using the plyr package. I can easily create a list of lm() objects which works well. However I can't quite wrap my head around how I use those objects later to predict values in a separate data.frame.

Here's a totally contrived example illustrating what I'm trying to do:

# setting up some fake data
set.seed(1)
funct <- function(myState, myYear){
   rnorm(1, 100, 500) +  myState + (100 * myYear) 
}
state <- 50:60
year <- 10:40
myData <- expand.grid( year, state)
names(myData) <- c("year","state")
myData$value <- apply(myData, 1, function(x) funct(x[2], x[1]))
## ok, done with the fake data generation. 

require(plyr)

modelList <- dlply(myData, "state", function(x) lm(value ~ year, data=x))
## if you want to see the summaries of the lm() do this:  
    # lapply(modelList, summary)

state <- 50:60
year <- 50:60
newData <- expand.grid( year, state)
names(newData) <- c("year","state") 
## now how do I predict the values for newData$value 
   # using the regressions in modelList? 

So how do I use the lm() objects contained in modelList to predict values using the year and state independent values from newData?

Shreyas Karnik
  • 3,953
  • 5
  • 27
  • 26
JD Long
  • 59,675
  • 58
  • 202
  • 294

6 Answers6

9

Here's my attempt:

predNaughty <- ddply(newData, "state", transform,
  value=predict(modelList[[paste(piece$state[1])]], newdata=piece))
head(predNaughty)
#   year state    value
# 1   50    50 5176.326
# 2   51    50 5274.907
# 3   52    50 5373.487
# 4   53    50 5472.068
# 5   54    50 5570.649
# 6   55    50 5669.229
predDiggsApproved <- ddply(newData, "state", function(x)
  transform(x, value=predict(modelList[[paste(x$state[1])]], newdata=x)))
head(predDiggsApproved)
#   year state    value
# 1   50    50 5176.326
# 2   51    50 5274.907
# 3   52    50 5373.487
# 4   53    50 5472.068
# 5   54    50 5570.649
# 6   55    50 5669.229

JD Long edit

I was inspired enough to work out an adply() option:

pred3 <- adply(newData, 1,  function(x)
    predict(modelList[[paste(x$state)]], newdata=x))
head(pred3)
#   year state        1
# 1   50    50 5176.326
# 2   51    50 5274.907
# 3   52    50 5373.487
# 4   53    50 5472.068
# 5   54    50 5570.649
# 6   55    50 5669.229
JD Long
  • 59,675
  • 58
  • 202
  • 294
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • that totally nails it! Thank you mucho. Can you explain where the data.frame `piece` comes from? Is it autogenerated by ddply? – JD Long Dec 13 '11 at 23:21
  • @JDLong: `.fun` is ultimately called on a data frame named `piece`. But, as @BrianDiggs pointed out in chat, this shouldn't be relied upon. Better to wrap in an anonymous function (see my update). – Joshua Ulrich Dec 13 '11 at 23:30
  • hi, if you could have a look at my question it would be great http://stackoverflow.com/questions/43427392/apply-predict-between-data-frames-within-two-lists . thanks! – aaaaa Apr 15 '17 at 15:05
  • @JDLong Am I able to get standard errors using this approach? – juliamm2011 Oct 17 '19 at 18:02
  • @juliamm2011 I think all you have to do is turn `se.fit=TRUE` per this question: https://stackoverflow.com/a/33660779/37751 note that we're now 8 years after this question was answered and I would no longer be using `adply` any longer but rather would probably use `broom` – JD Long Oct 18 '19 at 02:51
8

You need to use mdply to supply both the model and the data to each function call:

dataList <- dlply(newData, "state")

preds <- mdply(cbind(mod = modelList, df = dataList), function(mod, df) {
  mutate(df, pred = predict(mod, newdata = df))
})
hadley
  • 102,019
  • 32
  • 183
  • 245
6

A solution with just base R. The format of the output is different, but all the values are right there.

models <- lapply(split(myData, myData$state), 'lm', formula = value ~ year)
pred4  <- mapply('predict', models, split(newData, newData$state))
Ramnath
  • 54,439
  • 16
  • 125
  • 152
  • thanks @ramnath. I really like comparing base R solutions to those done with packages. It helps me both improve my base R understanding as well as understand the compromises I am making when using abstractions like plyr. – JD Long Dec 14 '11 at 04:48
  • And this is how I normally solve the problem - but with `dlply` and `mdply` – hadley Dec 14 '11 at 12:30
  • @hadley Could you show a worked example for this case? I tried building one with `mdply` and could not figure out how to do it because `.data` has to be a matrix or data.frame, and the two arguments to `predict` are an `lm` object and a `data.frame`. I couldn't stuff a list of `lm` objects as a column in a `data.frame`. The other approach I tried, making `.data` a list of lists, (`.data=list(object=modelList, newData=newDataList)` where `newDataList <- dlply(newData, .(state), identity)`) didn't work because `.data` wasn't a matrix or data.frame (as per the documentation). – Brian Diggs Dec 14 '11 at 19:04
  • In brief, cbind the two lists together – hadley Dec 16 '11 at 03:12
4

What is wrong with

lapply(modelList, predict, newData)

?

EDIT:

Thanks for explaining what is wrong with that. How about:

newData <- data.frame(year)
ldply(modelList, function(model) {
  data.frame(newData, predict=predict(model, newData))
})

Iterate over the models, and apply the new data (which is the same for each state since you just did an expand.grid to create it).

EDIT 2:

If newData does not have the same values for year for every state as in the example, a more general approach can be used. Note that this uses the original definition of newData, not the one in the first edit.

ldply(state, function(s) {
  nd <- newData[newData$state==s,]
  data.frame(nd, predict=predict(modelList[[as.character(s)]], nd))
})

First 15 lines of this output:

   year state  predict
1    50    50 5176.326
2    51    50 5274.907
3    52    50 5373.487
4    53    50 5472.068
5    54    50 5570.649
6    55    50 5669.229
7    56    50 5767.810
8    57    50 5866.390
9    58    50 5964.971
10   59    50 6063.551
11   60    50 6162.132
12   50    51 5514.825
13   51    51 5626.160
14   52    51 5737.496
15   53    51 5848.832
Brian Diggs
  • 57,757
  • 13
  • 166
  • 188
  • that's exactly the sort of thing I keep cooking up, but it's not really what I'm after. That applies every model to every state. I only want the model where state==50 to be applied to the data where state==50 – JD Long Dec 13 '11 at 22:49
2

I take it the hard part is matching each state in newData to the corresponding model.

Something like this perhaps?

predList <- dlply(newData, "state", function(x) {
  predict(modelList[[as.character(min(x$state))]], x) 
})

Here I used a "hacky" way of extracting the corresponding state model: as.character(min(x$state))

...There is probably a better way?

Output:

> predList[1:2]
$`50`
       1        2        3        4        5        6        7        8        9       10       11 
5176.326 5274.907 5373.487 5472.068 5570.649 5669.229 5767.810 5866.390 5964.971 6063.551 6162.132 

$`51`
      12       13       14       15       16       17       18       19       20       21       22 
5514.825 5626.160 5737.496 5848.832 5960.167 6071.503 6182.838 6294.174 6405.510 6516.845 6628.181

Or, if you want a data.frame as output:

predData <- ddply(newData, "state", function(x) {
  y <-predict(modelList[[as.character(min(x$state))]], x)
  data.frame(id=names(y), value=c(y))
})

Output:

head(predData)
  state id    value
1    50  1 5176.326
2    50  2 5274.907
3    50  3 5373.487
4    50  4 5472.068
5    50  5 5570.649
6    50  6 5669.229
Tommy
  • 39,997
  • 12
  • 90
  • 85
1

Maybe I'm missing something, but I believe lmList is the ideal tool here,

library(nlme)
ll = lmList(value ~ year | state, data=myData)
predict(ll, newData)


## Or, to show that it produces the same results as the other proposed methods...
newData[["value"]] <- predict(ll, newData)
head(newData)
#   year state    value
# 1   50    50 5176.326
# 2   51    50 5274.907
# 3   52    50 5373.487
# 4   53    50 5472.068
# 5   54    50 5570.649
# 6   55    50 5669.229
Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
baptiste
  • 75,767
  • 19
  • 198
  • 294