2

I am calculating, hopefully right, the hypergeometric test per row in a data frame in R.

Where column 1 is names of genes (microRNAs), the column "Total_mRNAs" is how many mRNA exist in total in the genome so that doesn't change. Column "Total_targets_targets" is how many mRNAs each microRNA can target if all the mRNAs are present. However, for this example only "subset_mRNAs" are present (that number is also always the same) and among these I know how many mRNAs each microRNA can target "subset_targets".

In order to determine if targets for each microRNA are enriched compared to the background (total mRNAs and total microRNAs targeting them) I am performing a hypergeometric test per row like this:

phyper(targets-in-subset, targets-in-bkgd, failure-in-bkgd, sample-size-subset, lower.tail= FALSE)



dput(df1)
structure(list(Genes_names = c("microRNA-1", "microRNA-2", "microRNA-3", 
"microRNA-4", "microRNA-5", "microRNA-6", "microRNA-7", "microRNA-8", 
"microRNA-9", "microRNA-10"), Total_mRNAs = c(61064L, 61064L, 
61064L, 61064L, 61064L, 61064L, 61064L, 61064L, 61064L, 61064L
), Total_targets_targets = c(1918L, 7807L, 3969L, 771L, 2850L, 
1355L, 1560L, 2478L, 1560L, 2478L), subset_mRNAs = c(17571L, 
17571L, 17571L, 17571L, 17571L, 17571L, 17571L, 17571L, 17571L, 
17571L), subset_targets = c(544L, 2109L, 1137L, 213L, 793L, 394L, 
430L, 686L, 430L, 686L)), class = "data.frame", row.names = c(NA, 
-10L))


df1$pvalue <- phyper(df1$subset_targets, df1$Total_targets_targets, df1$Total_mRNAs-df1$Total_targets_targets, df1$subset_mRNAs, lower.tail= FALSE)

Now the question is how can I Bonferroni correct this values? Is this calculation theoretically right?

  • You can use the ```apply``` function for this. Please give more details on what you want. From your code listenings I do not know how you want to calculate what exactly. – MacOS Nov 23 '20 at 14:04
  • Sorry I corrected the question lots. But ill try and explain it a bit better. – Amaranta_Remedios Nov 23 '20 at 14:05
  • I don't understand where this ```df1$Total-df1$targets``` comes from. – MacOS Nov 23 '20 at 14:08
  • df1$Total = all the mRNAs present in the genome of that specie 61064 df1$targets = all the targets each microRNA can theoretically have if all the mRNAs are present 1918 for the first one. – Amaranta_Remedios Nov 23 '20 at 14:17

2 Answers2

3

If you have not a lot of samples, avoid this pain and simply use a fisher test and do bonferroni using p.adjust:

library(broom)
result = lapply(1:nrow(df1),function(i){
       not_target_subset = df1$Total_targets_targets[i] - df1$subset_targets[i]
       not_subset = df1$Total_mRNAs[i] - df1$subset_mRNAs[i] - not_target_subset
       
       
       M = cbind(c(df1$subset_targets[i],df1$subset_mRNAs[i]-df1$subset_targets[i]),
             c(not_target_subset,not_subset))
      
       res = data.frame(Genes_names=df1$Genes_names[i],
                tidy(fisher.test(M,alternative="greater")))

       return(res)
})

result= do.call(rbind,result)
result$padj = p.adjust(result$p.value,"bonferroni")

Your hyper-geometric code is slightly off. And note that you are doing one sided hyper-geometric test.

You can check this post for how to put the tables into phyper and this for why you need the -1 . So we calculate the hypergeometric p-value:

result$hyper_p = with(df1, 
phyper(subset_targets-1,subset_mRNAs,Total_mRNAs-subset_mRNAs, Total_targets_targets, lower.tail= FALSE)
)

And you can see it tallies:

   Genes_names  estimate   p.value  conf.low conf.high
1   microRNA-1 0.9793710 0.6655527 0.8984025       Inf
2   microRNA-2 0.9047305 0.9998968 0.8647701       Inf
3   microRNA-3 0.9933480 0.5791759 0.9350214       Inf
4   microRNA-4 0.9441864 0.7722712 0.8229140       Inf
5   microRNA-5 0.9520878 0.8789562 0.8863785       Inf
6   microRNA-6 1.0151760 0.4119600 0.9168998       Inf
7   microRNA-7 0.9404619 0.8641420 0.8539585       Inf
8   microRNA-8 0.9454359 0.8942082 0.8756678       Inf
9   microRNA-9 0.9404619 0.8641420 0.8539585       Inf
10 microRNA-10 0.9454359 0.8942082 0.8756678       Inf
                               method alternative   hyper_p
1  Fisher's Exact Test for Count Data     greater 0.6655527
2  Fisher's Exact Test for Count Data     greater 0.9998968
3  Fisher's Exact Test for Count Data     greater 0.5791759
4  Fisher's Exact Test for Count Data     greater 0.7722712
5  Fisher's Exact Test for Count Data     greater 0.8789562
6  Fisher's Exact Test for Count Data     greater 0.4119600
7  Fisher's Exact Test for Count Data     greater 0.8641420
8  Fisher's Exact Test for Count Data     greater 0.8942082
9  Fisher's Exact Test for Count Data     greater 0.8641420
10 Fisher's Exact Test for Count Data     greater 0.8942082
StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • I would mark this as the correct answer. @StupidWolf ' answer is qualitatively better than mine. The user does a way better job at explaining the qualitative part of the question, while I only focus on the quantiative part (i.e. code). – MacOS Nov 23 '20 at 15:29
  • Thanks to both of you!! really appreciate the answers and the explanations because I am having problems understanding the different test. I didn't have lots of samples because I was using that as a test my real data is bigger (350 samples) – Amaranta_Remedios Nov 23 '20 at 15:54
2

Warning: The user that asked that question pointed out that the calculations in this answer might be wrong. Please see the comments below.

Based on your edit, it seems that you are looking for is

df1$new.column <- apply(df1,
                        margin = 1,
                        function(row),
                        {
                             return(phyper(row$targets.1, row$targets, sum(row$targets.1, row$targets), row$subset, lower.tail= FALSE))
                        }

EDIT

As pointed out in a comment by StupidWolf, phyper is vectorized. So, you can use (copied from the comment)

with(df1, phyper(targets.1, targets, sum(targets.1, targets), subset, lower.tail= FALSE)

HTH!

MacOS
  • 1,149
  • 1
  • 7
  • 14
  • Thanks, I changed lots during the edits. But that's exactly what I was looking for originally. Thanks!! – Amaranta_Remedios Nov 23 '20 at 14:19
  • you don't need to use the apply function. phyper is vectorized. also you need to a bonferroni correction as in the OP's question? – StupidWolf Nov 23 '20 at 14:21
  • 1
    if you do ```with(df1,phyper(targets.1, targets, sum(targets.1, targets), subset, lower.tail= FALSE)``` you get back the same result – StupidWolf Nov 23 '20 at 14:22
  • 1
    Oh, I did not now that ```phyper``` is vectorized. I will change my answer. – MacOS Nov 23 '20 at 14:24
  • 1
    Thanks, I don't understand why this part is a sum [sum(targets.1, targets)] shouldn't that be a subtraction of total mRNAs- targeted mRNAs? – Amaranta_Remedios Nov 23 '20 at 14:38
  • @Amaranta_Remedios: This is a qualitative question, not a quantitative question. Therefore, I'm not qualified to answer this question. :) – MacOS Nov 23 '20 at 14:39
  • Please update your question such that it is 100% clear what you want to do. You could, for example, pick on row and write done what you want to calculate. – MacOS Nov 23 '20 at 14:40
  • Thanks again to both of you! You and the entire community deserve acknowledgement in my thesis :) – Amaranta_Remedios Nov 23 '20 at 17:24