Perform linear regression over all columns of data fame where first column is predictor

Question

I have searched StackOverflow for answers to this question but am still struggling - apologies if this looks too much like a duplicate question.

I have a dataframe similar to this:

df <- data.frame(Cohort = c('con', 'con', 'dis', 'dis', 'con', 'dis'),
                 Sex = c('M', 'F', 'M', 'F', 'M', 'M'),
                 P1 = c(50, 40, 70, 80, 45, 75),
                 P2 = c(10, 9, 15, 13, 10, 8))

I want to perform a linear regression on all numeric columns of my dataframe using "Cohort" as the predictor (with the intent of adding features, such as "Sex", in future analysis).

I subset my dataframe to drop all irrelevant columns (in this toy example, Sex):

new_df <- df[,-c(Sex)]

Then I perform the regression like this:

fit <- lapply(new_df[-1], function(y){summary(lm(y ~ Cohort, data=new_df))})

When I test this on a small subset of my df (~5 columns) it works fine. In reality my df is ~7300 columns. When I run the command on the full dataframe I get this error:

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
NA/NaN/Inf in 'y'

I then assumed it was an issue with N/A values but when I do this I get back '0':

sum(is.na(new_df))

I have also tried the na.action=na.omit but that did not help the error either.

My end goal is to perform these regressions and extract the p-value and r-squared values using anova(fit)$'Pr(>F)' and summary(fit)$r.squared, respectively.

How can I correct this error, or is there a better method to do this? Additionally, moving forward how can I perform this by not subsetting my dataframe when I add other features to the regression?

EDIT:

@Parfait A dput() example of my df:

dput(new_data[1:4, 1:4])

structure(list(Cohort = c("Disease", "Disease", "Control", "Control"), 
    seq.10010.10 = c(8.33449676839042, 8.39959836912012, 8.34385193344212, 
    8.43546191447928), seq.10011.65 = c(11.5222872738433, 11.7652860987237, 
    11.1661630826461, 11.008848763327), seq.10012.5 = c(10.5414838640543, 
    10.6862378767518, 10.5408061105915, 10.726558779105)), class = c("soma_adat", 
"data.frame"), row.names = c("258633854330_1", "258633854330_3", 
"258633854330_5", "258633854330_6")

You should also check for `Inf` values as suggested by the error - you've only checked for `NA` values so far (these aren't the same thing, e.g. `sum(is.na(Inf))` returns 0. — nrennie, May 28 '23 at 23:16
@nrennie I just tested this using ```is.nan``` and ```is.inifinte``` and get 0 for both, so it doesn't appear to be an issue with these 'null' type values. — bhumm, May 29 '23 at 00:12
@Parfait specifically, ```sapply(new_df, function(x) sum(is.infinite(x))) %>% sum()``` and ```sapply(new_df, function(x) sum(is.nan(x))) %>% sum()``` — bhumm, May 29 '23 at 00:38
Are you really just running that formula? No log transformation? Are all ~7,300 columns integer or numeric? — Parfait, May 29 '23 at 00:48
@Parfait I do perform log10 transformation earlier in my workflow. The vast majority of the ~7000 columns are numeric, so admittedly my toy example maybe isn't as representative as I originally thought. The other ~40 columns contain numeric, integer, and string data mostly describing the data (ie., sample Ids, etc.) - I drop those columns prior to running ```lm``` and the ```is.nan``` functions. — bhumm, May 29 '23 at 00:53

Parfait · Accepted Answer · 2023-05-30T15:43:16.520

1

Consider passing the column name into method and build formula dynamically with reformulate. Even run tryCatch to process all columns and capture the columns raising errors. Below returns a list of data frames of extracted stats from model results.

fit_df_list <- sapply(
    colnames(new_df)[-1],
    function(col) {
        tryCatch({
          fml <- reformulate("Cohort", col)
          fit <- lm(fml, data = new_df)
          results <- summary(fit)

          data.frame(
              variable = col,
              r_squared = results$r.squared,
              f_stat = results$fstatistic["value"],
              f_pvalue = anova(fit)$'Pr(>F)'[1]
          )
        }, error = \(e) paste("Error on", col, ":", e)
        )
    },
    simplify = FALSE
)

# FILTER FOR PROBLEMATIC COLUMNS
fit_err_list <- Filter(is.character, fit_df_list)

# BUILD SINGLE, MASTER DATA FRAME
fit_df <- do.call(rbind, Filter(is.data.frame, fit_df_list))

edited May 30 '23 at 15:43

answered May 29 '23 at 00:42

Parfait

104,375
17
94
125

So I just tried this code and got the same error noted above as well as ```Warning: NAs introduced by coercion```. – bhumm May 29 '23 at 01:16
That is not an error but a warning. Does the `fit_df_list` get generated? – Parfait May 29 '23 at 04:55
Yes but I get the same error: ```$P1 [1] "Error on seq.10010.10 : Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): NA/NaN/Inf in 'y'\n"``` for each column in my df. – bhumm May 29 '23 at 14:03
Can you `dput` a small sample of your `new_df` data? See [How to make a great R reproducible example?](https://stackoverflow.com/q/5963269/1422451) Also, adjust post to exact `lm` model you are attempting to run. – Parfait May 29 '23 at 14:27
I have added a portion of the output of `dput()` in my original post. I don't understand the second part of your comment of the exact model, can you provide clarification? – bhumm May 30 '23 at 14:26
Whoops this solution had the character column, `Cohort`, as dependent variable. Simply reverse order in `reformulate`. See [edit](https://stackoverflow.com/posts/76354106/revisions). – Parfait May 30 '23 at 15:38
This works perfectly now - thank you! Final question: if I want to add features to this model, for example 'sex', do I do something like ```fml <- reformulate(c("Cohort", "Sex"), col)```? Thank you again for the help! – bhumm May 30 '23 at 23:58
1

Yes, you can. Learn more with the docs `?reformulate`. You can even use interacting terms `"Cohort*Sex"` for `termlabels` argument. – Parfait May 31 '23 at 00:40

Perform linear regression over all columns of data fame where first column is predictor

1 Answers1