1

I am using ddply to execute glm on subsets of my data. I am having difficulty accessing the estimated Y values. I am able to get the model parameter estimates using the below code, but all the variations I've tried to get the fitted values have fallen short. The dependent and independent variables in the glm model are column vectors, as is the "Dmsa" variable used in the ddply operation.

Define the model:

Model <- function(df){coef(glm(Y~D+O+B+A+log(M), family=poisson(link="log"), data=df))}

Execute the model on subsets:

Modrpt <- ddply(msadata, "Dmsa", Model)

Print Modrpt gives the model coefficients, but no Y estimates.

I know that if I wasn't using ddply, I can access the glm estimated Y values by using the code:

Model <- glm(Y~D+O+B+A+log(M), family=poisson(link="log"), data=msadata)

fits <- Model$fitted.values

I have tried both of the following to get the fitted values for the subsets, but no luck:

fits <- fitted.values(ddply(msadata, "Dmsa", Model))

fits <- ddply(msadata, "Dmsa", fitted.values(Model))

I'm sure this is a very easy to code...unfortunately, I'm just learning R. Does anyone know where I am going wrong?

Carter
  • 179
  • 3
  • 8

2 Answers2

3

You can use an anonymous function in your call to ddply e.g.

require(plyr)
data(iris)
model <- function(df){
    lm( Petal.Length ~ Sepal.Length + Sepal.Width , data = df )
    }

ddply( iris , "Species" , function(x) fitted.values( model(x) ) ) 

This has the advantage that you can also, without rewriting your model function, get thecoef values by doing

    ddply( iris , "Species" , function(x) coef( model(x) ) ) 

As @James points out, this will fall down if you have splits of unequal size, better to use dlply which puts the result of each subset in it's own list element.

(I make no claims for statistical relevance or correctness of the example model - it is just an example)

Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184
  • This only works if the splits are of equal size. To use this more generally, use `dlply` instead of `ddply`. – James Aug 06 '13 at 14:11
  • Simon, I think this is nearly what I need. Using this code, I am able to get the `coef` version to work (your bottom line of code), but when I use the `fitted.values` code I get the following error: `Error in list_to_dataframe(res, attr(.data, "split_labels")) : Results do not have equal lengths`. Any idea what may be causing this? – Carter Aug 06 '13 at 14:20
  • Yes, read James comment above and use `dlply` instead of `ddply`, you have unequal splits in your data (the data.frames that are returned do not have same number of columns) – Simon O'Hanlon Aug 06 '13 at 14:22
  • 1
    Yes, got it to work! Thank you to James and Simon. To clarify then, if I have 5 subsets, I should get 5 output lists using `dlply`, correct? How then would I pull these lists together as vectors where each list is a row vector (i.e., resulting in 5 rows with a column for each observation)? – Carter Aug 06 '13 at 14:34
  • 1
    Okay, got these lists pulled together as vectors using cbind. Thanks again for your help. – Carter Aug 06 '13 at 14:55
  • One more question if you don't mind...since it builds on this. Ultimately, I'm trying ensure the sum of the estimated Y values for each subset is equal to the sum of the observed Y values. `cbind` has allowed me to get these lists as row vectors (with two columns), but the Y estimates for each subset are all contained in a single cell for row as `c(1435.234, 8778.1389, ....)`. Is there a way to get the sum of these estimates within the `cbind` function? I know I can use `total = sum(model$outlistname)` to get the sum for an individual list, but now sure how to automate that within `cbind`. – Carter Aug 06 '13 at 15:15
  • Why would the sum of the estimated be equal to the observed? Do you mean the length? I don't think I understand what you mean. Perhaps a better way is to ask another question? Illustrating with data. – Simon O'Hanlon Aug 06 '13 at 15:22
0

I'd recommending doing this in two steps:

library(plyr)

# First first the models
models <- dlply(iris, "Species", lm, 
  formula = Petal.Length ~ Sepal.Length + Sepal.Width )

# Next, extract the fitted values
ldply(models, fitted.values)

# Or maybe
ldply(models, as.data.frame(fitted.values))
hadley
  • 102,019
  • 32
  • 183
  • 245