3

I need to store lm fit object in a data frame for further processing (This is needed as I will have around 200+ regressions to be stored in the data frame). I am not able to store the fit object in the data frame. Following code produces the error message:

x = runif(100)
y = 2*x+runif(100)
fit = lm(y ~x)

df = data.frame()
df = rbind(df, c(id="xx1", fitObj=fit))

Error in rbind(deparse.level, ...) : 
  invalid list argument: all variables should have the same length

I would like to get the data frame as returned by "do" call of dplyr, example below:

> tacrSECOutput
Source: local data frame [24 x 5]
Groups: <by row>

                            sector control     id1     fit count
1  Chemicals and Chemical Products       S tSector <S3:lm>  2515
2     Construation and Real Estate       S tSector <S3:lm>   985

Please note that this is a sample output only. I would like to create the data frame (fit column for the lm object) in the above format so that my rest of the code can work on the added models.

What am I doing wrong? Appreciate the help very much.

kishore
  • 541
  • 1
  • 6
  • 18
  • 3
    You should use a `list` rather than a `data.frame`. –  Nov 10 '15 at 09:54
  • @Pascal, how would the code look as I will have a number of regressions to store and then access them further? – kishore Nov 10 '15 at 09:56
  • 1
    You may look into lapply... – Heroka Nov 10 '15 at 09:57
  • The resulting object of `lm()` is not a dataframe. read the documentation of `lm()`, section **Value** – jogo Nov 10 '15 at 10:00
  • Not really a duplicate, as the OP wants to store the `fit` object, not only the coefficients. –  Nov 10 '15 at 10:01
  • 1
    Check out the [broom](https://github.com/dgrtwo/broom) package, which is made for converting fits into data frames. – lukeA Nov 10 '15 at 10:11
  • 1
    Check `broom` package. Nothing beats that.... Links: https://cran.r-project.org/web/packages/broom/vignettes/broom.html , http://varianceexplained.org/r/broom-intro/ – AntoniosK Nov 10 '15 at 10:11
  • As @Pascal pointed out, I need to store the fit objet, not components of it. I looked into all of the above suggestions but couldn't find a way to store the fit object itself. I need to simulate what "do" of dplyr returns. I can't use "do" as my data is scattered across various places. – kishore Nov 10 '15 at 10:17
  • I have seen it done, looking it up now. – Mike Wise Nov 10 '15 at 10:31
  • This might help: http://stackoverflow.com/questions/5599896/how-do-i-store-arrays-of-statistical-models –  Nov 10 '15 at 10:41

2 Answers2

6

The list approach:

Clearly based on @Pascal 's idea. Not a fan of lists, but in some cases they are extremely helpful.

   set.seed(42)
x <- runif(100)
y <- 2*x+runif(100)
fit1 <- lm(y ~x)

set.seed(123)
x <- runif(100)
y <- 2*x+runif(100)
fit2 <- lm(y ~x)


# manually select model names
model_names = c("fit1","fit2")

# create a list based on models names provided
list_models = lapply(model_names, get)

# set names
names(list_models) = model_names

# check the output
list_models

# $fit1
# 
# Call:
#   lm(formula = y ~ x)
# 
# Coefficients:
#   (Intercept)            x  
#        0.5368       1.9678  
# 
# 
# $fit2
# 
# Call:
#   lm(formula = y ~ x)
# 
# Coefficients:
#   (Intercept)            x  
#        0.5545       1.9192 

Given that you have lots of models in your work space, the only "manual" thing you have to do is provide a vector of your models names (how are they stored) and then using the get function you can obtain the actual model objects with those names and save them in a list.


Store model objects in a dataset when you create them:

The data frame can be created using dplyr and do if you are planning to store the model objects when they are created.

library(dplyr)

set.seed(42)
x1 = runif(100)
y1 = 2*x+runif(100)

set.seed(123)
x2 <- runif(100)
y2 <- 2*x+runif(100)


model_formulas = c("y1~x1", "y2~x2")

data.frame(model_formulas, stringsAsFactors = F) %>%
  group_by(model_formulas) %>%
  do(model = lm(.$model_formulas))

#     model_formulas   model
#              (chr)   (chr)
#   1          y1~x1 <S3:lm>
#   2          y2~x2 <S3:lm>

It REALLY depends on how "organised" is the process that allows you to built those 200+ models you mentioned. You can build your models this way if they depend on columns of a specific dataset. It will not work if you want to build models based on various columns of different datasets, maybe of different work spaces or different model types (linear/logistic regression).


Store existing model objects in a dataset:

Actually I think you can still use dplyr using the same philosophy as in the list approach. If the models are already built you can use their names like this

library(dplyr)

set.seed(42)
x <- runif(100)
y <- 2*x+runif(100)
fit1 <- lm(y ~x)

set.seed(123)
x <- runif(100)
y <- 2*x+runif(100)
fit2 <- lm(y ~x)


# manually select model names
model_names = c("fit1","fit2")

data.frame(model_names, stringsAsFactors = F) %>%
  group_by(model_names) %>%
  do(model = get(.$model_names))


#   model_names   model
#         (chr)   (chr)
# 1        fit1 <S3:lm>
# 2        fit2 <S3:lm>
AntoniosK
  • 15,991
  • 2
  • 19
  • 32
  • 1
    I need to get a data frame with lm obejct as an object. Please see the added explanations in the question. Thanks for the approach. – kishore Nov 10 '15 at 11:14
  • I'm afraid I can help you only if you are planning to create/build the models and store them in a data frame. Don't know how you can get ALREADY built models and store in a data frame. – AntoniosK Nov 10 '15 at 11:25
  • Thanks for the code, it gave me an idea about how to go about doing it. Pl look at the comment for @MikeWise answer as his approach also helped me to get a solution. – kishore Nov 10 '15 at 13:50
  • 1
    I've updated my answer so you can use `dplyr` even if the models are pre-built. Let me know if you want me to delete the part of my answer that uses lists. – AntoniosK Nov 10 '15 at 14:02
  • @AntoniosK Can you add an example for how to extract models from a list and place them into a data frame? That way you could use broom on the model data frame even if the models weren't generated using dplyr. – wdkrnls Sep 23 '16 at 19:42
  • @AntoniosK: I thought I could take same approach you took, just treating the model list as the environment. However, when applying broom to that I get only the first model. – wdkrnls Sep 23 '16 at 20:40
  • Yes, broom is independent of dplyr package. Can you create a new question based on what you said? I have to reproduce your problem so I can solve it @wdkrnls . – AntoniosK Sep 23 '16 at 21:42
3

This seems to work:

x = runif(100)
y = 2*x+runif(100)
fit = lm(y ~x)

df <- data.frame()
fitvec <- serialize(fit,NULL)
df <- rbind(df, data.frame(id="xx1", fitObj=fitvec))

fit1 <- unserialize( df$fitObj )
print(fit1)

yields:

   Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
      0.529        1.936  

Update Okay, now more complex, so as to get one row per fit.

vdf <- data.frame()
fitlist <- list()
niter <- 5

for (i in 1:niter){
  # Create a new model each time
  a <- runif(1)
  b <- runif(1)
  n <- 50*runif(1) + 50
  x <- runif(n)
  y <- a*x + b + rnorm(n,0.1)

  fit <- lm(x~y)

  fitlist[[length(fitlist)+1]] <- serialize(fit,NULL)
}

vdf <- data.frame(id=1:niter)
vdf$fitlist <- fitlist

for (i in 1:niter){
  print(unserialize(vdf$fitlist[[i]]))
}

yields:

Call:
lm(formula = x ~ y)

Coefficients:
(Intercept)            y  
    0.45689      0.07766  


Call:
lm(formula = x ~ y)

Coefficients:
(Intercept)            y  
    0.44922      0.00658  


Call:
lm(formula = x ~ y)

Coefficients:
(Intercept)            y  
    0.41036      0.04522  


Call:
lm(formula = x ~ y)

Coefficients:
(Intercept)            y  
    0.40823      0.07189  


Call:
lm(formula = x ~ y)

Coefficients:
(Intercept)            y  
    0.40818      0.08141  
Mike Wise
  • 22,131
  • 8
  • 81
  • 104
  • This creates a data frame of 11000+ rows. What I need is one data frame row per model. Please see the added comment in the question. – kishore Nov 10 '15 at 11:15
  • Ok, fixed that. Now one model per row. – Mike Wise Nov 10 '15 at 11:47
  • Anything wrong with this? – Mike Wise Nov 10 '15 at 13:07
  • Sorry stepped out for a bit. Let me do a bit of tweaking to see if I can use your approach to get what I am looking for. I will update the comment as I find out. – kishore Nov 10 '15 at 13:29
  • I used your approach where a list of lm object is created separately and the data frame is also built simultaneously. Finally the lm object list is added to the list form of data frame and a data frame is created using as_data_frame of dplyr package. Thanks a bunch to you and @AntoniosK for the help. – kishore Nov 10 '15 at 13:52
  • I guess I am not able to accept 2 posts as answer, but your approach has provided me vital clue for the solution. – kishore Nov 10 '15 at 13:54
  • Well, you could give us a few points you know... Its not like they cost you anything. – Mike Wise Nov 10 '15 at 14:14
  • 1
    Sorry, forgot about it... it is done. Thanks – kishore Nov 10 '15 at 14:17