7

This is a little rudimentary, I know. Basically, I want to use the save data from the coef function to a shared data frame for models that all pull limited possible variables from a larger shared data set.

I have 3 sets of 14 models. Each set uses 15-25 variables from a 100 variable data-set, and each of the models uses a mix of about 12 variables, which change from model to model. What I would like to do is save the coefficients for each of the 14 models into one data-frame.

Coefs=data.frame(col.names = names(EST))

The coefficients look something like this:

Coefficients:
                    Estimate   Std. Error  t value           Pr(>|t|)    
RT_SCORE_USER       0.2427506  0.0310486   7.818 0.0000000000000836 ***
VOD.Window..weeks.  0.0092641  0.0009985   9.278            < 2e-16 ***
PX_WK3              0.0300395  0.0098943   3.036           0.002600 ** 

For a good 10-15 variables. For instance, PX has 14 weeks (WK1, 2, etc). I want to save the Estimate values into this grid, where for each row, there are 100 columns listing all possible variables. The majority of which will be 0. This table will be imported into excel where I can simply cross multiply for each week's model.

My struggle is figuring out how to record all the varying coefficients from the various weeks into ONE data.frame, where each model has a separate row:

       PX_WK1     PXWK_2   RT_SCORE_USER  IMAVARIABLE etc.
ESTWK1   .030     0         .24            0
ESTWK2   0        .023      .44            etc
ESTWK3   0        0         etc etc etc

I understand how to use coef(ESTWK1), but when I try to paste that into a row, I naturally get an error confusing the lengths of the two vectors, say 15 in this model out of a potential 100.

I want to automate this process so when new data is processed and the regressions are run, I can run my code saving the new coefficients' data, and then I can output that to a CSV (that part I've got). Thoughts?

Jason O
  • 75
  • 2
  • 7

3 Answers3

9

The first step is to combine your coefficients into a data frame with one row per combination of model and term. Then you'll be able to spread it into a table with one row per model and one column per term.

My broom package has a useful function, tidy for turning a linear fit into a data frame of coefficients:

fit <- lm(mpg ~ wt + disp + qsec, mtcars)
library(broom)
tidy(fit)
#          term  estimate std.error statistic p.value
# 1 (Intercept) 19.777558    5.9383    3.3305 0.00244
# 2          wt -5.034410    1.2241   -4.1127 0.00031
# 3        disp -0.000128    0.0106   -0.0121 0.99042
# 4        qsec  0.926649    0.3421    2.7087 0.01139

(Note that unlike coef, this returns a data frame rather than a matrix, and incorporates the terms as a column rather than rownames). You can apply this function to each of your models and then recombine, for example with plyr's ldply. We generate an example using 20 of the same model as your "models":

models <- replicate(20, lm(mpg ~ wt + disp + qsec, mtcars), simplify = FALSE)
names(models) <- paste0("MODEL", 1:20)

Then our "tidy and recombine" code will be:

all_coefs <- plyr::ldply(models, tidy, .id = "model")
head(all_coefs)
#    model        term  estimate std.error statistic p.value
# 1 MODEL1 (Intercept) 19.777558    5.9383    3.3305 0.00244
# 2 MODEL1          wt -5.034410    1.2241   -4.1127 0.00031
# 3 MODEL1        disp -0.000128    0.0106   -0.0121 0.99042
# 4 MODEL1        qsec  0.926649    0.3421    2.7087 0.01139
# 5 MODEL2 (Intercept) 19.777558    5.9383    3.3305 0.00244
# 6 MODEL2          wt -5.034410    1.2241   -4.1127 0.00031

You then need to remove the std.error, statistic, and p.value columns and spread the estimate term out. This can be done with the dplyr and tidyr packages:

library(dplyr)
library(tidyr)
results <- all_coefs %>% select(-(std.error:p.value)) %>%
    spread(term, estimate)

This produces:

     model (Intercept)      disp  qsec    wt
1   MODEL1        19.8 -0.000128 0.927 -5.03
2   MODEL2        19.8 -0.000128 0.927 -5.03
3   MODEL3        19.8 -0.000128 0.927 -5.03
4   MODEL4        19.8 -0.000128 0.927 -5.03
5   MODEL5        19.8 -0.000128 0.927 -5.03

Which is your desired output. (This output is boring since all the models were the same, but presumably yours are different). If some models have coefficients others don't, the missing values will be filled in with NA.

David Robinson
  • 77,383
  • 16
  • 167
  • 187
  • Quick clarification question -- on the "tidy and recombine" section, how can I make the "models" variable include all 14 of my models (names are ESTWK1, ESTWK2, ...ESTWK14). These models have already been created and their coefficients are set. Before I run ldply, how do I turn all 14 of my models into a singular variable? Or do I just run each model one at a time, in a loop or whatever? – Jason O Apr 16 '15 at 21:39
  • @JasonO put them in a named list. How you get them into a named list depends on what structure they're currently in, which you haven't specified. For example, if they're 14 separate variables, you could do `list(ESTWK1 = ..., ESTWK2 = ...,)` and so on. – David Robinson Apr 16 '15 at 23:51
  • Pretty nice to get advice from the guy who made the package! This is the line tripping me up: `all_coefs <- plyr::ldply(models, tidy, .id = "model")` In any case, for the creation of the list, let's assume that ESTWK1 through ESTWK14 (and all the models in between) are all regression models. They vary in variable count from 25 to 8. In order to create the named list (the `models` portion of your code), I would write: `list(ESTWK1=?` .... what goes in place of the ... for each model in the list? – Jason O Apr 17 '15 at 00:18
  • You've already fit the models? And they're saved as separate variables called `ESTWK1`, `ESTWK2`, etc? If so, you could do `models <- list(ESTWK1=ESTWK1, ESTWK2=ESTWK2, ... ESTWK14=ESTWK14)`. To save yourself the typing, you could do the same thing with `mget(paste0("ESTWK", 1:14))`. (`mget` retrieves a list of objects by name) – David Robinson Apr 17 '15 at 00:20
  • So I have 14 models that have been run: ie ESTWK1=lm(VarName ~ alpha+beta+gamma, data=DATAZ). I need the coefficients from each one into the same data.frame. Taking your advice, I ran the code ESTMods<-list(mget(paste0("ESTWK", 1:14)))` ... So far so good, I have an object called `ESTMods` listed as a Large List of size 2.8Mb. But then `all_coefs <- plyr::ldply(ESTMods, tidy, .id = "model")` got a weird error... – Jason O Apr 17 '15 at 00:30
  • `Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) : cannot coerce class ""lm"" to a data.frame In addition: Warning message: In tidy.default(X[[1L]], ...) : No method for tidying an S3 object of class list , using as.data.frame` – Jason O Apr 17 '15 at 00:30
  • Please run `sapply(models, class)` and let me know what it says. – David Robinson Apr 17 '15 at 00:34
  • `sapply(ESTMods, class) [1] "list"` – Jason O Apr 17 '15 at 00:38
  • @JasonO ah, I see your issue. Change the line from `ESTMods<-list(mget(paste0("ESTWK", 1:14)))` to just `ESTMods<-mget(paste0("ESTWK", 1:14))`. The extra `list(` means it is a list of length 1, with its first item being the actual list of models. – David Robinson Apr 17 '15 at 00:39
  • Thank you! I literally just jumped out of my chair and scared my coworker. No coffee spilled. Have a great weekend! +1 – Jason O Apr 17 '15 at 00:41
2

I'd approach this by doing something like this:

x1 <- rnorm(10)
x2 <- rnorm(10)
x3 <- rnorm(10)
y <- rnorm(10)
m1 <- lm(y ~ x1 + x2)
m2 <- lm(y ~ x1 + x3) 
m3 <- lm(y ~ x2 + x3)

variables <- data.frame(variable = c("(Intercept)", "x1", "x2", "x3"),
                        model = rep(c("m1", "m2", "m3"), each = 4))
data <- data.frame(variable = c(names(coef(m1)), names(coef(m2)), 
                                names(coef(m3))),
                   estimate = c(coef(m1), coef(m2), coef(m3)), 
                   model = c(rep("m1", length(coef(m1))), 
                             rep("m2", length(coef(m2))),
                             rep("m3", length(coef(m3)))))
data2 <- left_join(variables, data)
data2$estimate[is.na(data2$estimate)] <- 0
data2
reshape(data2, timevar = "variable", v.names = "estimate", 
        idvar = "model", direction = "wide")

Basically, fit the models and then extract the estimates and the row names. Then make a data frame variables that includes all possible variable names for each model. Use left_join from dplyr to do the join and then reshape it into the format you want.

iacobus
  • 587
  • 3
  • 10
1

The tidyverse has evolved and now it's even simpler to train multiple models and to extract statistics from the fitted models.

Simulate data:

library("modelr")
library("tidyverse")

n <- 100
data <- tibble(
  x1 = rnorm(n),
  x2 = rnorm(n),
  x3 = rnorm(n),
  y = x1 + x2 + x3 + rnorm(n, sd = 0.3)
)

List of model formulas to train on the data:

fmlas <- formulas(
  ~y,
  m1 = ~x1,
  m2 = ~x1 + x2,
  m3 = ~x1 + x2 + x3
)
fmlas
#> $m1
#> y ~ x1
#> 
#> $m2
#> y ~ x1 + x2
#> 
#> $m3
#> y ~ x1 + x2 + x3

List of trained models:

fits <- data %>%
  fit_with(
    lm, fmlas
  )
fits
#> $m1
#> 
#> Call:
#> .f(formula = y ~ x1, data = data)
#> 
#> Coefficients:
#> (Intercept)           x1  
#>     -0.1542       0.6599  
#> 
#> 
#> $m2
#> 
#> Call:
#> .f(formula = y ~ x1 + x2, data = data)
#> 
#> Coefficients:
#> (Intercept)           x1           x2  
#>     0.06516      0.89962      1.06941  
#> 
#> 
#> $m3
#> 
#> Call:
#> .f(formula = y ~ x1 + x2 + x3, data = data)
#> 
#> Coefficients:
#> (Intercept)           x1           x2           x3  
#>    -0.02503      1.02045      1.02567      1.01573

Extract coefficient estimates:

fits %>%
  map_dfr(
    ~coefficients(.),
    .id = "model"
  )
#> # A tibble: 3 × 5
#>   model `(Intercept)`    x1    x2    x3
#>   <chr>         <dbl> <dbl> <dbl> <dbl>
#> 1 m1          -0.154  0.660 NA    NA   
#> 2 m2           0.0652 0.900  1.07 NA   
#> 3 m3          -0.0250 1.02   1.03  1.02

Created on 2022-04-20 by the reprex package (v2.0.1)

dipetkov
  • 3,380
  • 1
  • 11
  • 19