1

I am trying to write a function that will run multiple regressions and then store the outputs in a vector. What I want is for the function to pick the dependent variables from a list that I will provide, and then run the regressions on the same right hand-side variables. Not sure how to go about doing this. Any hints will be appreciated.

my_data <- data.frame(x1=(1:10) + rnorm(10, 3, 1.5), x2=25/3 + rnorm(10, 0, 1), 
                      dep.var1=seq(5, 28, 2.5), dep.var2=seq(100, -20, -12.5), 
                      dep.var3=seq(1, 25, 2.5))

## The following is a list that tells the function
dep.var <- list(dep.var1=my_data$dep.var1, dep.var2=my_data$dep.var2)
## which dependent variables to use from my_data

all_models <- function(dep.var){lm(dep.var ~ x1 + x2, data=my_data)}

model <- sapply(dep.var, all_models) ## The "sapply" here tells the function to 
## take the dependent variables from the list dep.var.

I want the "model" list to have two objects: model1 with dep.var1 and model2 with dep.var2. Then as required, I will use summary(model#) to see the regression output.

I know that this in theory works when a vector is used (i.e., p):

p <- seq(0.25, 0.95, 0.05)

s <- function(p) {1 - pnorm(35, p*1*44, sqrt(44)*sqrt(p*(1 - p)))}

f <- sapply(p, s)

But I can't get the whole thing to work as required for my regression models. It works somewhat because you can run and check "model" and it will show you the two regression outputs - but it is horrible. And the "model" does not show the regression specification, i.e., dep.var1 ~ x1 + x2.

jay.sf
  • 60,139
  • 8
  • 53
  • 110
Pineapple
  • 193
  • 8

3 Answers3

3

Consider reformulate to dynamically change model formulas using character values for lm calls:

# VECTOR OF COLUMN NAMES (NOT VALUES)
dep.vars <- c("dep.var1", "dep.var2")
 
# USER-DEFINED METHOD TO PROCESS DIFFERENT DEP VAR
run_model <- function(dep.var) {
    fml <- reformulate(c("x1", "x2"), dep.var)
    lm(fml, data=data)
}
                     
# NAMED LIST OF MODELS
all_models <- sapply(dep.vars, run_model, simplify = FALSE)

# OUTPUT RESULTS
all_models$dep.var1
all_models$dep.var2
...

From there, you can run further extractions or processes across model objects:

# NAMED LIST OF MODEL SUMMARIES
all_summaries <- lapply(all_models, summary)

all_summaries$dep.var1
all_summaries$dep.var2
...

# NAMED LIST OF MODEL COEFFICIENTS
all_coefficients <- lapply(all_models, `[`, "coefficients")

all_coefficients$dep.var1
all_coefficients$dep.var2
...
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Hi this works great. Can you explain a few things please. What really is reformulate doing here? I know it arranges the vectors in a formula but why couldn't it be done without reformulate? And you wrote "dep.var" in the function and in reformulate, but depvar**s** elsewhere - so it shouldn't work but it does. Shouldn't they all be the same? – Pineapple Sep 18 '21 at 16:51
  • Please see the link to the method, `reformulate`. When you call `lm`, you are passing a formula object (not string) as first argument. This method dynamically alters formula taking in character strings which is equivalent to `paste` and `as.formula` calls. for second question, `dep.var` in function is the parameter being passed from caller. `sapply` is the caller that iterates through `dep.vars`, the vector of many values. So each value of this vector is passed into function. – Parfait Sep 19 '21 at 03:27
1

You could sapply over the names of the dependent vars which you could nicely identify with grep. In lm use reformulate to build the formula.

sapply(grep('^dep', names(my_data), value=TRUE), \(x) 
       lm(reformulate(c('x1', 'x2'), x), my_data))
#               dep.var1           dep.var2           dep.var3          
# coefficients  numeric,3          numeric,3          numeric,3         
# residuals     numeric,10         numeric,10         numeric,10        
# effects       numeric,10         numeric,10         numeric,10        
# rank          3                  3                  3                 
# fitted.values numeric,10         numeric,10         numeric,10        
# assign        integer,3          integer,3          integer,3         
# qr            qr,5               qr,5               qr,5              
# df.residual   7                  7                  7                 
# xlevels       list,0             list,0             list,0            
# call          expression         expression         expression        
# terms         dep.var1 ~ x1 + x2 dep.var2 ~ x1 + x2 dep.var3 ~ x1 + x2
# model         data.frame,3       data.frame,3       data.frame,3   

The dep.var* appear nicely in the result.

However, you probably want to use lapply and pipe it into setNames() to get the list elements named. Instead of grep you may of course define the dependent variables manually. To get a clean formular call stored, we use a trick once @g-grothendieck taught me using do.call.

dv <- as.list(grep('^dep', names(my_data), value=TRUE)[1:2])
res <- lapply(dv, \(x) {
  f <- reformulate(c('x1', 'x2'), x)
  do.call('lm', list(f, quote(my_data)))
}) |>
  setNames(dv)
res
# $dep.var1
# 
# Call:
#   lm(formula = dep.var1 ~ x1 + x2, data = my_data)
# 
# Coefficients:
#   (Intercept)           x1           x2  
# -4.7450       2.3398       0.2747  
# 
# 
# $dep.var2
# 
# Call:
#   lm(formula = dep.var2 ~ x1 + x2, data = my_data)
# 
# Coefficients:
#   (Intercept)           x1           x2  
# 148.725      -11.699       -1.373 

This allows you to get the summary() of the objects, which probably is what you want.

   summary(res$dep.var1)
# Call:
#   lm(formula = dep.var1 ~ x1 + x2, data = my_data)
# 
# Residuals:
#   Min      1Q  Median      3Q     Max 
# -2.8830 -1.8345 -0.2326  1.4335  4.2452 
# 
# Coefficients:
#   Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  -4.7450     7.2884  -0.651    0.536    
# x1            2.3398     0.2836   8.251 7.48e-05 ***
#   x2            0.2747     0.7526   0.365    0.726    
# ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 2.55 on 7 degrees of freedom
# Multiple R-squared:  0.9117,  Adjusted R-squared:  0.8865 
# F-statistic: 36.14 on 2 and 7 DF,  p-value: 0.0002046

Finally you could wrap it in a function

calc_models <- \(dv) {
  lapply(dv, \(x) {
    f <- reformulate(c('x1', 'x2'), x)
    do.call('lm', list(f, quote(my_data)))
  }) |>
    setNames(dv)
}

calc_models(list('dep.var1', 'dep.var2'))
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • @Parfait See `lapply` part of my answer. What do you reckon could be introduced in terms of `reformulate` in a better way? – jay.sf Sep 18 '21 at 16:15
  • Thanks. But this isn't working on my end : ``` sapply(grep('^dep', names(my_data), value=TRUE), \(x) lm(reformulate(c('x1', 'x2'), x), my_data))``` It gives an error. Also, the second part starting with "dv" isn't working. – Pineapple Sep 18 '21 at 16:58
  • @Pineapple Just put the names of your dependent variables into a list named `dv` should work. – jay.sf Sep 18 '21 at 17:01
0

Here is a way how you could iterate through your dataframe and apply the function to the group you define (here dep.var) and save the different models in a dataframe:

library(tidyverse)
library(broom)
my_data %>% 
    pivot_longer(
        starts_with("dep"),
        names_to = "group",
        values_to = "dep.var"
    ) %>% 
    mutate(group = as.factor(group)) %>% 
    group_by(group) %>% 
    group_split() %>% 
    map_dfr(.f = function(df) {
        lm(dep.var ~ x1 + x2, data = df) %>%
             tidy() %>% # first output 
            #glance() %>% # second output
            add_column(group = unique(df$group), .before=1)
    })

dataframe output:

# A tibble: 9 x 6
  group    term        estimate std.error statistic  p.value
  <fct>    <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 dep.var1 (Intercept)   -5.29     11.6      -0.456 0.662   
2 dep.var1 x1             2.11      0.268     7.87  0.000101
3 dep.var1 x2             0.538     1.23      0.437 0.675   
4 dep.var2 (Intercept)  151.       57.9       2.61  0.0347  
5 dep.var2 x1           -10.6       1.34     -7.87  0.000101
6 dep.var2 x2            -2.69      6.15     -0.437 0.675   
7 dep.var3 (Intercept)   -9.29     11.6      -0.802 0.449   
8 dep.var3 x1             2.11      0.268     7.87  0.000101
9 dep.var3 x2             0.538     1.23      0.437 0.675 

list output:

[[1]]
# A tibble: 3 x 6
  group    term        estimate std.error statistic  p.value
  <fct>    <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 dep.var1 (Intercept)   -5.29     11.6      -0.456 0.662   
2 dep.var1 x1             2.11      0.268     7.87  0.000101
3 dep.var1 x2             0.538     1.23      0.437 0.675   

[[2]]
# A tibble: 3 x 6
  group    term        estimate std.error statistic  p.value
  <fct>    <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 dep.var2 (Intercept)   151.       57.9      2.61  0.0347  
2 dep.var2 x1            -10.6       1.34    -7.87  0.000101
3 dep.var2 x2             -2.69      6.15    -0.437 0.675   

[[3]]
# A tibble: 3 x 6
  group    term        estimate std.error statistic  p.value
  <fct>    <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 dep.var3 (Intercept)   -9.29     11.6      -0.802 0.449   
2 dep.var3 x1             2.11      0.268     7.87  0.000101
3 dep.var3 x2             0.538     1.23      0.437 0.675 

glance output:

  group    r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC deviance df.residual  nobs
  <fct>        <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
1 dep.var1     0.927         0.906  2.32      44.3 0.000106     2  -20.8  49.7  50.9     37.8           7    10
2 dep.var2     0.927         0.906 11.6       44.3 0.000106     2  -36.9  81.9  83.1    944.            7    10
3 dep.var3     0.927         0.906  2.32      44.3 0.000106     2  -20.8  49.7  50.9     37.8           7    10
TarJae
  • 72,363
  • 6
  • 19
  • 66