in R regress all columns against a single vector and store regression coefficients & R-squared values

Question

test_data <- cbind(Fund1 = c(NA, NA, NA,1,5,6,7,8,9,10),
                   Fund2 = c(NA, 1,2,4,5,6,7,5,NA,NA),
                   Fund3 = c(NA,2,4,5,6,7,5,4,NA,NA),
                   Fund4 = c(NA,NA,NA,NA,NA,NA,NA,NA,NA,NA))
peer_average <- rowMeans(as.data.frame(test_data), na.rm = TRUE)
test_data <- cbind(test_data, data.frame(peer_average))

I want to perform a regression of Fund 1, Fund 2 and Fund 3 against peer_average. In practice, I have a much larger dataframe, so want to make this simple to extend.

My goal is to have output_matrix1 = 4 Beta Coffecients, output_matrix2 = 4 alpha coefficients and output_matrix3 = 4 r-squared values, from each respective regression.

I know to run individual regressions as: lm(y ~ x), but I am not sure how to address NA's in the series. Originally I thought lapply would work but have not been able to figure it out. I want regressions to be calculated on all pairwise overlapping series. The problem with running lm(y~x) in this case, different to previous examples: Fitting a linear model with multiple LHS is Fund4

score 0 · Answer 1 · answered Jul 04 '23 at 06:31

0

The main issue is that your data is untidy- funds should be rows, fund # should be a column, and peer_average should be a separate thing.

I'm assuming you are comparing funds across different years, but you don't have a year column. I added one, to make it all run a bit smoother:

test_data <- test_data %>%
    select(-peer_average) %>% # removing peer average
    mutate(year = c(2001:2010)) %>% # adding year column
    pivot_longer(cols = -year, names_to = "Fund", values_to = "Value") %>% # pivoting it longer
    mutate(Fund =  as.numeric(gsub("Fund", "", Fund))) %>% # making fund number column a number, instead of things like "Fund 1"
    group_by(Fund) # grouping by fund

# test_data after cleaning:
    year  Fund Value
   <int> <dbl> <dbl>
 1  2001     1    NA
 2  2001     2    NA
 3  2001     3    NA
 4  2002     1    NA
 5  2002     2     1
 6  2002     3     2
 7  2003     1    NA
 8  2003     2     2
 9  2003     3     4
10  2004     1     1
# ℹ 20 more rows

test_data %>%
    summarise(model = list(lm(Value ~ peer_average))) %>%
    pull(model)
[[1]]

Call:
lm(formula = Value ~ peer_average)

Coefficients:
 (Intercept)  peer_average  
      -1.241         1.189  


[[2]]

Call:
lm(formula = Value ~ peer_average)

Coefficients:
 (Intercept)  peer_average  
     -0.5883        1.0831  


[[3]]

Call:
lm(formula = Value ~ peer_average)

Coefficients:
 (Intercept)  peer_average  
      1.8039        0.6468

answered Jul 04 '23 at 06:31

Mark

7,785
2
14
34

The issue is if I do: ```test_data <- Filter(function(x)!all(is.na(x)), test_data) lm(as.matrix(test_data) ~ test_data$peer_average, data = test_data , na.action = na.omit)``` it will also work. I understand that the data may not be in the best format, but it is how it is given nonetheless.Given the above solution, is there anything you can recommend? – Jak Carty Jul 04 '23 at 06:39
have you tried not doing that? The thing I would recommend is keeping your data tidy and with as few NAs as you can – Mark Jul 04 '23 at 06:40
the year column doesn't have to be a year - it could be the numbers 1 to 10 - it's just important once the data is in a long format to know what order things are in – Mark Jul 04 '23 at 06:44
re: it being a single vector- `dplyr` inputs the values to `lm` as a vector, grouped by firm for you, so you don't need to worry about that. If you start combining every firms results, and remove NAs, then it stops making any sense to do linear regression on it (because the size of peer_average and the vector will be totally different) – Mark Jul 04 '23 at 06:45
Sorry @Mark, I believe the results you get are different to the results I get in the calculation above. Does the output I get make sense to you? This is what I want. e.g. Fund 1 Int = -5.1241 and Beta = 1.9489 – Jak Carty Jul 04 '23 at 07:03
1

running Fund1 <- c(NA, NA, NA,1,5,6,7,8,9,10); lm(Fund1 ~ peer_average, data = test_data) doesn't give me that – Mark Jul 04 '23 at 07:06
I can achieve the desired result - using your process - by (1) removing Fund4 from the series and (2) group_by "Fund" variable. But in either case, I still need to remove cases where all data is NA – Jak Carty Jul 04 '23 at 07:07
ah okay, you changed the input data ........,,, – Mark Jul 04 '23 at 07:11
1

Don't stress. Sorry You're right, Happy to leave it here! – Jak Carty Jul 04 '23 at 07:11
1

in the case in which some of the values in a group are NA, you really don't have to remove the NAs, `lm` handles it. If **all** of them are NAs, then there's nothing to make a regression off of. So removing it is the best bet – Mark Jul 04 '23 at 07:13

score 0 · Accepted Answer · answered Jul 04 '23 at 11:10

Here is how I would do it. Similar to the other answer, I would go from wide to long format (one function), then I would nest, map out the regressions, and pull out the coefficients:

library(tidyverse)

pivot_longer(test_data, 
             -peer_average, 
             names_to = "Fund", 
             names_pattern = "Fund(\\d+)", 
             values_drop_na = TRUE) |>
  nest(model = -Fund) |>
  mutate(model = map(model, ~summary(lm(peer_average~value, data = .x))),
         R2 = map_dbl(model, ~.x$r.squared),
         model = map(model, ~broom::tidy(.x)$estimate),
         model = map(model, ~set_names(.x, c("Intercept", "Coef"))))  |>
  unnest_wider(model)
#> # A tibble: 3 x 4
#>   Fund  Intercept  Coef    R2
#>   <chr>     <dbl> <dbl> <dbl>
#> 1 2         0.880 0.845 0.915
#> 2 3         0.273 0.897 0.580
#> 3 1         2.12  0.677 0.805

in R regress all columns against a single vector and store regression coefficients & R-squared values

2 Answers2