0

I would like to loop through the independent variables and regress them on the dependent variable with data.table. Because of the huge size of my dataset I need an efficient solution. I have found this suggestion with the mtcars dataframe as an example:

library(data.table)
Fits <- as.data.table(mtcars)[, list(MyFits = lapply(.SD[, -1, with = F], function(x) summary(lm(mpg ~ x))))]

I tried it first on a few of my own datasets without much success. I then tried to apply it to mtcars itself, giving the following unexpected result: 10 rows of the variable MyFits each looking like the example below.

list(call = lm(formula = mpg ~ x), terms = mpg ~ x, residuals = c(0.370164348925326, 0.370164348925409, -3.58141592920354, 0.770164348925413, 3.82174462705436, -2.52983565107458, -0.578255372945635, -1.98141592920354, -3.58141592920354, -1.42983565107459, -2.82983565107459, 1.52174462705436, 2.42174462705436, 0.321744627054363, -4.47825537294564, -4.47825537294564, -0.178255372945637, 6.01858407079646, 4.01858407079646, 7.51858407079646, -4.88141592920354, 0.621744627054364, 0.321744627054363, -1.57825537294564, 4.32174462705436, 0.918584070796464, -0.381415929203536, 4.01858407079646, 0.921744627054365, -0.929835651074587, 0.121744627054364, -4.98141592920354), coefficients = c(37.8845764854614, -2.87579013906447, 2.07384360552423, 0.322408882659104, 18.2678078445963, -8.91969884745751, 8.36915530493018e-18, 6.11268714258098e-10), aliased = c(FALSE, FALSE), sigma = 3.20590203190608, df = c(2, 30, 2), r.squared = 0.726180005093805, adj.r.squared = 0.717052671930265, fstatistic = c(79.5610275293349, 1, 30 ), cov.unscaled = c(0.418457648546144, -0.0625790139064475, -0.0625790139064475, 0.0101137800252844))

The author of the answer Linear Regression loop for each independent variable individually against dependent already mentioned the answer was in need of an update, but I am not figuring out what is going wrong.

Any suggestions?

Tom
  • 2,173
  • 1
  • 17
  • 44
  • What is the expected output, i.e., which part of the output provided by summary do you actually need? – Roland Aug 09 '18 at 11:53
  • Personally, I might do this: `library(reshape2); DF <- melt(mtcars, id.vars = "mpg"); library(nlme); fits <- lmList(mpg ~ value | variable, data = DF); summary(fits)`, although I would be wary of your reason for doing this exercise. – Roland Aug 09 '18 at 11:57
  • I am not sure if I follow your first question; but at least the coefficient, sd, t-value and p-value. You mean my scientific reason for doing this exercise? – Tom Aug 09 '18 at 12:03
  • 3
    Yes, this looks like a dangerous statistical fishing expedition (which, e.g., has implications for the p-values or any inference). – Roland Aug 09 '18 at 12:04
  • Haha, yes I am aware.. And it's good that you point that out, although it will more likely lead me to abandon the dataset because all the inconsistencies I am expecting to find.. At the moment it is more a curiosity/time saving exercise, because my dataset is huge, and manually selecting variables is making me go insane. I will be careful and prudent with interpreting the results. – Tom Aug 09 '18 at 12:12
  • Although it works for `mtcars`, I am getting an `Error in na.fail.default(data) : missing values in object`, for my own data I think because my dependent variable has NA's. I added `DF <- DF[!is.na(DF)]` or `DF <- DF[!is.na(DF$mp)]` after melting to remove the NA's. I then got the warning `attributes are not identical across measure variables; they will be dropped` and the error `Error in eval(x[[length(x)]], object) : object 'variable' not found`. Could this have to do with NA's in the independent variables? I wanted to try to fix this last question, but I don't think I truly understand how. – Tom Aug 09 '18 at 12:41
  • Another factor could be that my variables are not always numerical but often factors. Could it be that I need to incorporate a `tryCatch` exception to your solution? How could I incorporate this into your suggestion? – Tom Aug 09 '18 at 12:44
  • `na.action = na.omit` or you could look into multiple imputation. I don't know why you need to select variables but can't you just use the LASSO or elastic net? – Roland Aug 09 '18 at 13:14
  • I think the fact that I had to look up what those terms mean should answer your question haha. But after reading up a bit, that would actually be exactly what I need. I will read up on it further. Thank you so much! – Tom Aug 09 '18 at 13:30

0 Answers0