Multiple linear regression for multiple columns (one at a time) in a dataframe in R

Question

I have a dataset with columns representing wavenumbers along with weight. A sample dataset is given below and the actual dataset has almost 3000+ columns related to different wavenumbers. I want to perform a multiple linear regression for each wavenumber (weight as the independent variable and each wavenumber at a time as the dependent variable) and obtain the R^2 values for each wavenumber. The objective is to identify the most important wavenumbers based on these R^2 values

For example, the linear regression model for 1st wavenumber will be:

X400 ~ weight X401 ~ weight etc...

I want to create a loop to obtain the R^2 values for each wavenumber and store it in a separate table to use in further analysis.

I used the below code to build a linear model for 1 wavenumber. But I don't understand to create a loop for each column as the original dataset has 3000+ wavenumbers.

  df1 = data.frame(
  weight = c(15, 18, 20, 21, 18),
  X400 = c(12.5, 9, 16.5, 9, 20),
  X401 = c(12, 19, 14.5, 9, 26),
  X402 = c(11.5, 9.6, 18.5, 19, 20),
  X403 = c(10.5, 8, 12.5, 17, 23),
  X404 = c(12.5, 9, 15, 9, 20),
  X405 = c(14.5, 19, 12.5, 8, 21.2),
  X406 = c(13.5, 7, 18.5, 12, 17),
  X407 = c(12, 3.9, 12.9, 10, 4.8))

  model <- lm(X400 ~ weight, data = df1)
  summary(model)$r.squared

The output table should look like this. R1, R2, R3,... represents the R^2 values related to each wavenumber.

Wavenumber	R^2 value
X400	R1
X401	R2
X402	R3
X403	R4
X404	R5
X405	R6
X406	R7
X407	R8

How can I do this in R?

Something like this will get the values: `sapply(names(df1)[-1], function(var) summary(lm(paste(var,"~weight"), df1))$r.squared)` — MrFlick, Jan 20 '23 at 14:28
The response vectors can be arranged as a matrix on the LHS: `models <- lm(as.matrix(df1[-1]) ~ weight, df1); models |> summary() |> sapply(function(x) x$r.squared) |> stack() |> setNames(c("R2", "Response")) |> rev() |> transform(Response = sub(".* ", "", Response))` Everything from stack onwards is just to make it a data frame with nice names and can be omitted if that is not needed. — G. Grothendieck, Jan 20 '23 at 14:30

score 0 · Answer 1 · answered Jan 20 '23 at 14:30

How about this:

library(dplyr)
library(tidyr)
df1 = data.frame(
  weight = c(15, 18, 20, 21, 18),
  X400 = c(12.5, 9, 16.5, 9, 20),
  X401 = c(12, 19, 14.5, 9, 26),
  X402 = c(11.5, 9.6, 18.5, 19, 20),
  X403 = c(10.5, 8, 12.5, 17, 23),
  X404 = c(12.5, 9, 15, 9, 20),
  X405 = c(14.5, 19, 12.5, 8, 21.2),
  X406 = c(13.5, 7, 18.5, 12, 17),
  X407 = c(12, 3.9, 12.9, 10, 4.8))

df1 %>% 
  pivot_longer(-weight, names_to="vbl", values_to = "vals") %>% 
  group_by(vbl) %>% 
  summarise(r2 = summary(lm(vals ~ weight))$r.squared)
#> # A tibble: 8 × 2
#>   vbl        r2
#>   <chr>   <dbl>
#> 1 X400  0.00941
#> 2 X401  0.0398 
#> 3 X402  0.403  
#> 4 X403  0.0871 
#> 5 X404  0.0249 
#> 6 X405  0.259  
#> 7 X406  0.0162 
#> 8 X407  0.00384

^{Created on 2023-01-20 by the reprex package (v2.0.1)}

Multiple linear regression for multiple columns (one at a time) in a dataframe in R

1 Answers1