8

I want to figure out how to create a loop or using one of the apply functions to get individual 1:1 regression information for each variable in a dataset against the dependent variable.

Lets say I am using mtcars. How would I write in R code that takes each variable in the data frame and regresses it against MPG?

Even better would be getting a summary of each independent variable with and having some sort of name assignment such as x1=, x2=etc

summary(lm(mpg~eachvar,data=mtcars))
runningbirds
  • 6,235
  • 13
  • 55
  • 94
  • A non-standard approach for this problem: [Fast pairwise simple linear regression between variables in a data frame](https://stackoverflow.com/q/51953709/4891738). The `general_paired_simpleLM` could be useful when all your variables are numeric. – Zheyuan Li Aug 27 '18 at 01:50

3 Answers3

15

This will do it for you.

lapply( mtcars[,-1], function(x) summary(lm(mtcars$mpg ~ x)) )

A data.frame object is a list with some other features so this will go through each column of mtcars excluding the first one and perform the regressions. If you save the resulting list in something like L then you can access each one easily by just using the same name or number as the column in the original data.frame. So L$cyl gives the regression summary for mpg on cyl.

John
  • 23,360
  • 7
  • 57
  • 83
  • Actually this one makes more sense. And could also easily do stuff like `lapply(L, function(x) x$r.squared) ; lapply(L, coef)` – David Arenburg Jul 30 '14 at 12:33
7

A data.table version of Johns solution

library(data.table)
Fits <- 
    data.table(mtcars)[, 
              .(MyFits = lapply(.SD, function(x) summary(lm(mpg ~ x)))), 
              .SDcols = -1]

Some explanations of the code

  • data.table will convert mtcars to a data.table object
  • .SD is also a data.table object which contains the columns one wants to operate on
  • .SDcols = -1 tells .SD not to use first column (as we don't want to fit lm(mpg ~ mpg)
  • lapply just runs the model over all the columns in .SD (except the one we skipped) and returns objects of class list

Fit will a be list of summaries, you can inspect them using

Fits$MyFits

But you can also operate on them, for example, applying coef function on each fit

Fits[, lapply(MyFits, coef)]

Or getting the r.squered

Fits[, lapply(MyFits, `[[`, "r.squared")]
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • Thanks for this! When I use this solution I get the following error: `Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 0 (non-NA) cases` Any ideas what leads to this error? I want to use this on a rather "dirty" dataset. Could it be that some exceptions are needed? Is it for example possible to add a `try` statement to this solution to prevent it from blowing up? – Tom Aug 08 '18 at 12:49
  • It probably means that all of your values are NAs probably. You need to clean your data or use `tryCatch`. Either way, this answer is old and needs some update. – David Arenburg Aug 08 '18 at 15:52
  • Thank you for your answer. I thought that at first, but I removed all variables where all values (more than 99%) were NA. For my particular (huge) dataset perhaps it is more likely that there are some non-numerical variables in there? But I guess then `tryCatch` would still be the solution. I have not used `data.table` a lot yet. Would it be possible to show me where to incorporate the `tryCatch`? – Tom Aug 09 '18 at 07:18
  • You could simply check that the variable is numeric first, e.g. `data.table(mtcars)[, .(MyFits = lapply(.SD, function(x) if(is.numeric(x)) summary(lm(mpg ~ x)))), .SDcols = -1]` – David Arenburg Aug 09 '18 at 13:05
  • Thank you, I still have some trouble seeing how I apply statements like that. When I applied your solution to `mtcars` by the way I get then rows which starts like `list(call = lm(formula = mpg ~ x), terms = mpg ~ x, residu..` Was this the intended outcome or is something going wrong there? – Tom Aug 09 '18 at 13:39
  • Not sure what you mean. It works fine for me on mtcars. – David Arenburg Aug 09 '18 at 16:43
  • It does, apparently I am just a bit of an idiot and only read half your answer. My apologies and thank you for your help and patience! – Tom Aug 09 '18 at 16:53
  • @ David Arenburg. thank you for this. it helps me a lot. and I wanna make sure if I want to list all models by AIC so that we can see the best to worst model. tnx – R starter May 05 '19 at 20:07
3

Hi try something like that :

models <- lapply(paste("mpg", names(mtcars)[-1], sep = "~"), formula)
res.models <- lapply(models, FUN = function(x) {summary(lm(formula = x, data = mtcars))})
names(res.models) <- paste("mpg", names(mtcars)[-1], sep = "~")
res.models[["mpg~disp"]]


# Call:
# lm(formula = x, data = mtcars)

# Residuals:
#     Min      1Q  Median      3Q     Max 
# -4.8922 -2.2022 -0.9631  1.6272  7.2305 

# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 29.599855   1.229720  24.070  < 2e-16 ***
# disp        -0.041215   0.004712  -8.747 9.38e-10 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# Residual standard error: 3.251 on 30 degrees of freedom
# Multiple R-squared:  0.7183,  Adjusted R-squared:  0.709 
# F-statistic: 76.51 on 1 and 30 DF,  p-value: 9.38e-10
Victorp
  • 13,636
  • 2
  • 51
  • 55