Run ANOVA on every column of a dataframe (cols representing mRNA/proteins)

Question

In a data frame of features (eg proteins or mRNA transcripts), I want to calculate a one-way ANOVA for every column (representing expression values). Each row represents a sample (id) that belongs to a group (5 groups in example data).

Searching the web and Stackoverflow only gave me non-functional and partial answers, the expected result would be a table with the ANOVA p-value for every feature (eg a df with first col = feature, second col = ANOVA p-value). Below the example data I post a non-working solution.

For pairwise differential expression analysis between the groups, I am using LIMMA which delivers multiple-comparison corrected p-values. If I understand correctly, ANOVA does not care for mul.comp.cor but only shows for which features there is a difference between any of the 5 groups (and a post-test, eg LIMMA, shows me in which of the groups there is the difference). It would be great if you could confirm that this way of assessing differential expression between my 5 groups is correct.

The solution was taken from here: One-way ANOVA for loop: how do I iterate through multiple columns of a dataframe?

example data

# gene df and list 
genes <- paste("gene",1:1000,sep="")
x <- list(
  A = sample(genes,300), 
  B = sample(genes,525), 
  C = sample(genes,440),
  D = sample(genes,350)
)

# expression dataframe
crete_exp_df <- function(gene_nr, sample_nr){
  df <- replicate(sample_nr, rnorm(gene_nr))
  rownames(df) <- paste("Gene", c(1:nrow(df)))
  colnames(df) <- paste("Sample", c(1:ncol(df)))
  return(df)
}

df1 <- crete_exp_df(50, 20)
df1 <- as.data.frame(df1)
df1$fid <- rownames(df1)

# creator for ANOVA
df4ANOVA <- df1 %>% pivot_longer(-fid) %>%
  pivot_wider(names_from="fid", values_from="value") %>%
  rename(id=name)

df4ANOVA$group <- c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5)

solution (not working)

library(tidyverse)
library(broom)

df4ANOVA$group <- as.factor(df4ANOVA$group )
df4ANOVA$id <- NULL

# ANOVA
baseformula <- " ~ group"

ANOVA_group <- for (i in 2:ncol(exp4anov)) {
 formula <- paste(colnames(exp4anov)[i], baseformula, sep="")

 p <- summary(aov(as.formula(formula), data=exp4anov))[[1]][["Pr(>F)"]][1]

 print(paste(formula, ": p=", p, sep=""))

}

score 2 · Answer 1 · answered Apr 24 '23 at 13:08

2

A possibility is the ffmanova package:

library(ffmanova)
res <- ffmanova(as.matrix((df4ANOVA[1:50]))~ df4ANOVA$group)

You will find the classical one-way p-values as the second row of res$pRaw. That is, res$pRaw[2, ].

By adding, for example, nSim = 9999, to the ffmanova call, adjusted alternatives to pRaw are also available (res$pAdjusted and res$pAdjFDR).

answered Apr 24 '23 at 13:08

Øyvind Langsrud

181
6

Very cool package, thanks a lot! I just ired it with my data but strangely enough, I get a very low number of genes with FDR <0.05 (5 out of 7543). From the differential expression, I know that around 1000 genes change per comparison between the groups, so the result is a bit unexpected. I used the suggested nSim = 9999. Can you think of a reason why the FDR-controlled transcripts are so low? – Sebastian Hesse Apr 24 '23 at 17:37
For adjusted p I get only 2 genes <= 0.05, for FDR 5 (they include the 2 from adjusted p. without correction I get 2351. Would be cool if you have a suggestion on what might be wrong with my approach. Maybe the sensitivity of LIMMA is simply much greater to pick up the difference between two of the groups than an ANOVA between all 5? – Sebastian Hesse Apr 24 '23 at 17:59
I'm using vsn (variant stabilizing normalization) counts (they are similar to log2 counts) – Sebastian Hesse Apr 24 '23 at 18:00
Found the solution. After reading your paper on campylobacter, it seems that ffmanova is sensitive to missing data! I reduced the dataset to complete.cases and now get 6458 genes with a FDR < 0.05. Thanks for suggesting your package! – Sebastian Hesse Apr 24 '23 at 18:22
The result of ffmanova is very close to the one from LIMMA (just found out that I can get an ANOVA from LIMMA as well!). Does your package allow some kind of evaluation of the foldChange? Normally, I include a logFC of >1 into the significnace criteria. The F value seems to be useful (calculating empirical vs critical F) but it doesn't seem to be easily applicable to a test of thousands of genes. Thanks a lot for your suggestions! – Sebastian Hesse Apr 24 '23 at 19:10
1

Nice that you like `ffmanova`. The package is written for very general models and one-way anova is only a special case. The F statistic (or t) is available as `res$stat`. More can be retrieved with parameters `returnYhat` and `newdata`. But exactly what you want for your application is probably not there. `nSim = 9999` was just a quick suggestion. The higher the number the better, but it can take time. Another possibility is `nSim = c(0, 99999)`. This avoids the calculation of an uninteresting intercept. – Øyvind Langsrud Apr 25 '23 at 09:28

score 1 · Accepted Answer · answered Apr 24 '23 at 12:57

You can use lapply to loop over your df4ANOVA data frame, using grep to only identify columns with Gene:

aov_list <- lapply(df4ANOVA[grep("Gene", colnames(df4ANOVA))], 
                   function(x) aov(df4ANOVA$group ~ x))

This results in a list of length 50, each containing the full results of aov for the column.

If you want a final data frame with p values and gene names, you could first tweak the lapply statement to just extract p values, then do.call to put them together:

aov_plist <- lapply(df4ANOVA[grep("Gene", colnames(df4ANOVA))], 
                   function(x) summary(aov(df4ANOVA$group ~ x))[[1]][5][1,])

finaldat <- data.frame(pval = do.call(rbind, aov_plist))
finaldat$gene <- rownames(finaldat)

Output:

#                pval    gene
# Gene 1  0.551729974  Gene 1
# Gene 2  0.725349369  Gene 2
# Gene 3  0.983277774  Gene 3
# Gene 4  0.811234760  Gene 4
# Gene 5  0.765013222  Gene 5
# Gene 6  0.144887277  Gene 6
# ...

A "cleaner" way to do this may be to use dplyr::bind_rows and tibble::rownames_to_column:

data.frame(t(bind_rows(aov_plist))) %>% 
  tibble::rownames_to_column("Gene")

Both get you functionally the same resulting data frame.

I actually decided to accept yours, because you showed the way how to apply the function to the df, right as I asked it, so fair is fair. But the package of Oyvand will be used in my analysis :) — Sebastian Hesse, Apr 24 '23 at 16:13

Run ANOVA on every column of a dataframe (cols representing mRNA/proteins)

example data

solution (not working)

2 Answers2