big dataframe: "repeated" t-test between groups for thousand of factors

Question

I have read a lot of posts related to data wrangling and “repeated” t-test but I can’t figure out the way to achieve it in my case.

You can get my example dataset for StackOverflow here: https://www.dropbox.com/s/0b618fs1jjnuzbg/dataset.example.stckovflw.txt?dl=0

I have a big dataframe of gen expression like:

> b<-read.delim("dataset.example.stckovflw.txt")

> head(b)
      animal            gen condition tissue    LogFC
1 animalcontrol1         kjhss1   control  brain 7.129283
2 animalcontrol1          sdth2   control  brain 7.179909
3 animalcontrol1     sgdhstjh20   control  brain 9.353147
4 animalcontrol1 jdygfjgdkydg21   control  brain 6.459432
5 animalcontrol1  shfjdfyjydg22   control  brain 9.372865
6 animalcontrol1      jdyjkdg23   control  brain 9.541097

> str(b)
'data.frame':   21507 obs. of  5 variables:
 $ animal   : Factor w/ 25 levels "animalcontrol1",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ gen      : Factor w/ 1131 levels "dghwg1041","dghwg1086",..: 480 761 787    360 863 385 133 888 563 738 ...
 $ condition: Factor w/ 5 levels "control","treatmentA",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ tissue   : Factor w/ 2 levels "brain","heart": 1 1 1 1 1 1 1 1 1 1 ...
 $ LogFC    : num  7.13 7.18 9.35 6.46 9.37 ...

Each group has 5 animals, and each animals has many gens quantified. (However, each animal may possibly have a different set of quantified gens, but also many of the gens will be in common between animals and groups).

I would like to perform t-test for each gen between my treated group (A, B, C or D) and the controls. The data should be presented as a table containing the p- value for each gen in each group.

Because I have so many gens (thousand), I cannot subset each gen.

Do you know how could I automate the procedure ?

I was thinking about a loop but I am absolutely not sure it could achieve what I want and how to proceed.

Also, I was looking more at these posts using the apply function : Apply t-test on many columns in a dataframe split by factor and Looping through t.tests for data frame subsets in r

# ################ additionnal information after reading first comments and answers :

@andrew_reece : Thank you very much for this. It is almost-exactly what I was looking for. However, I can’t find the way to do it with t-test. ANOVA is interesting information, but then I will need to know which of the treated groups is/are significantly different from my controls. Also I would need to know which treated group is significantly different from each others, “two by two”.

I have been trying to use your code by changing the “aov(..)” in “t.test(…)”. For that, first I realize a subset(b, condition == "control" | condition == "treatmentA" ) in order to compare only two groups. However, when exporting the result table in csv file, the table is unanderstandable (no gen name, no p-values, etc, only numbers). I will keep searching a way to do it properly but until now I’m stuck.

@42:

Thank you very much for these tips. This is just a dataset example, let’s assume we do have to use individual t-tests.

This is very useful start for exploring my data. For example, I have been trying to reprsent my data with Venndiagrams. I can write my code but it is kind of out of the initial topic. Also, I don't know how to summarize in a less fastidious way the shared "gene" detected in each combination of conditions so i have simplified with only 3 conditions.

# Visualisation of shared genes by VennDiagrams :
# let's simplify and consider only 3 conditions :

b<-read.delim("dataset.example.stckovflw.txt")
b<- subset(b, condition == "control" | condition == "treatmentA" | condition == "treatmentB")

b1<-table(b$gen, b$condition)

b1

b2<-subset(data.frame(b1, "control" > 2 
              |"treatmentA" > 2 
              |"treatmentB" > 2 ))

b3<-subset(b2, Freq>2) # select only genes that have been quantified in at     least 2 animals per group
b3
b4 = within(b3, {
  Freq = ifelse(Freq > 1, 1, 0)
}) # for those observations, we consider the gene has been detected so we     change the value 0 regardless the freq of occurence (>2)


b4

b5<-table(b4$Var1, b4$Var2)
write.csv(b5, file = "b5.csv")

# make an intermediate file .txt (just add manually the name of the cfirst     column title)
# so now we have info
bb5<-read.delim("bb5.txt")

nrow(subset(bb5, control == 1))
nrow(subset(bb5, treatmentA == 1))
nrow(subset(bb5, treatmentB == 1))

nrow(subset(bb5, control == 1 & treatmentA == 1))
nrow(subset(bb5, control == 1 & treatmentB == 1))
nrow(subset(bb5, treatmentA == 1 & treatmentB == 1))

nrow(subset(bb5, control == 1 & treatmentA == 1 & treatmentB == 1))

library(grid)
library(futile.logger)
library(VennDiagram)

venn.plot <- draw.triple.venn(area1 = 1005, 
                          area2 = 927, 
                          area3 = 943, 
                          n12 = 843, 
                          n23 = 861, 
                          n13 = 866, 
                          n123 = 794, 
                          category = c("controls", "treatmentA",     "treatmentB"),  
                          fill = c("red", "yellow", "blue"),
                          cex = 2,
                          cat.cex = 2,
                          lwd = 6,
                          lty = 'dashed',
                          fontface = "bold",
                          fontfamily = "sans",
                          cat.fontface = "bold",
                          cat.default.pos = "outer",
                          cat.pos = c(-27, 27, 135),
                          cat.dist = c(0.055, 0.055, 0.085),
                          cat.fontfamily = "sans",
                          rotation = 1);

could you give an example of one t-test you'd like to perform? I'm not sure I fully understand the question — shuckle, May 12 '18 at 23:31
I'm assuming your use of the terms `gen` and `gens` is really asking for consideration of these values as "gene" and "genes" as in "genetic". — IRTFM, May 12 '18 at 23:34
Not an answer to your problem, but I don't think a t-test is the most appropriate statistical test to use here, something you can ask over at stats.stackexchange.com. — hpesoj626, May 12 '18 at 23:43
@shuckle : b<-read.delim("dataset.example.stckovflw.txt") > b1<- subset(b, condition == "control" | condition == "treatmentA" ) > b2<- subset(b1, gen == "kjhss1") > t.test(LogFC ~ condition, + paired=FALSE, + data=b1) Welch Two Sample t-test data: LogFC by condition t = 0.23853, df = 9235, p-value = 0.8115 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.3407286 0.4351405 sample estimates: mean in group control mean in group treatmentA 9.966224 9.919019 — SkyR, May 13 '18 at 13:48
@42 : yes indeed it is actually "gene" instead of "gen" (I've been confused because in spanish it is "gen"...) — SkyR, May 13 '18 at 13:48
@hpesoj626 : I am just interested in the R technique here to actually apply to other biological endpoint, this is just a fake dataset for the example in order to familiarize with the R techniques. I agree that t-test are usually not recommended in genomics… sorry for this bad example. Let’s just assume we have to use t-tests and that it is another biological endpoint. — SkyR, May 13 '18 at 13:49
@SkyR, note that `@` references in your actual post don't get flagged in user accounts. I wouldn't have seen your updated request if I hadn't been checking. Better to post answer-specific responses as comments for a given answer. — andrew_reece, May 13 '18 at 16:08

andrew_reece · Answer 1 · 2018-05-13T16:07:03.027

Update (per OP comments):
Pairwise comparison across condition can be managed with an ANOVA post-hoc test, such as Tukey's Honest Significant Difference (stats::TukeyHSD()). (There are others, this is just one way to demonstrate PoC.)

results <- b %>%
  mutate(condition = factor(condition)) %>%
  group_by(gen) %>%
  filter(length(unique(condition)) >= 2) %>%
  nest() %>%
  mutate(
    model = map(data, ~ TukeyHSD(aov(LogFC ~ condition, data = .x))),
    coef = map(model, ~ broom::tidy(.x))
  ) %>%
  unnest(coef) %>% 
  select(-term)

    results
# A tibble: 7,118 x 6
   gen        comparison            estimate conf.low conf.high adj.p.value
   <chr>      <chr>                    <dbl>    <dbl>     <dbl>       <dbl>
 1 kjhss1     treatmentA-control       1.58     -20.3      23.5       0.997
 2 kjhss1     treatmentC-control      -3.71     -25.6      18.2       0.962
 3 kjhss1     treatmentD-control       0.240    -21.7      22.2       1.000
 4 kjhss1     treatmentC-treatmentA   -5.29     -27.2      16.6       0.899
 5 kjhss1     treatmentD-treatmentA   -1.34     -23.3      20.6       0.998
 6 kjhss1     treatmentD-treatmentC    3.95     -18.0      25.9       0.954
 7 sdth2      treatmentC-control      -1.02     -21.7      19.7       0.991
 8 sdth2      treatmentD-control       3.25     -17.5      24.0       0.909
 9 sdth2      treatmentD-treatmentC    4.27     -16.5      25.0       0.849
10 sgdhstjh20 treatmentC-control      -7.48     -30.4      15.5       0.669
# ... with 7,108 more rows

Original answer
You can use tidyr::nest() and purrr::map() to accomplish the technical task of grouping by gen, and then conducting statistical tests comparing the effects of condition (presumably with LogFC as your DV).

But I agree with the other comments that there are issues with your statistical approach here that bear careful consideration - stats.stackexchange.com is a better forum for those questions.

For the purpose of demonstration, I've used an ANOVA instead of a t-test, since there are frequently more than two conditions per gen grouping. This shouldn't really change the intuition behind the implementation, however.

require(tidyverse)

results <- b %>%
  mutate(condition = factor(condition)) %>%
  group_by(gen) %>%
  filter(length(unique(condition)) >= 2) %>%
  nest() %>%
  mutate(
    model = map(data, ~ aov(LogFC ~ condition, data = .x)),
    coef = map(model, ~ broom::tidy(.x))
  ) %>%
  unnest(coef)

A few cosmetic trimmings to get closer to your original vision (of just a table with gen and p-values), although note that this really leaves a lot of important information out and I'm not advising you actually limit your results in this way.

results %>%
  filter(term!="Residuals") %>%
  select(gen, df, statistic, p.value)

results
# A tibble: 1,111 x 4
   gen               df statistic p.value
   <chr>          <dbl>     <dbl>   <dbl>
 1 kjhss1            3.     0.175   0.912
 2 sdth2             2.     0.165   0.850
 3 sgdhstjh20        2.     0.440   0.654
 4 jdygfjgdkydg21    2.     0.267   0.770
 5 shfjdfyjydg22     2.     0.632   0.548
 6 jdyjkdg23         2.     0.792   0.477
 7 fckjhghw24        2.     0.790   0.478
 8 shsnv25           2.     1.15    0.354
 9 qeifyvj26         2.     0.588   0.573
10 qsiubx27          2.     1.14    0.359
# ... with 1,101 more rows

Note: I can't take much credit for this approach - it's taken almost verbatim from an example I saw Hadley give at a talk last night on purrr. Here's a link to the public repo of the demo code he used, which covers a similar use case.

1,111 happens to be the number of rows in results before the second filter where the df was > 4. I would think the most interest at least to start would be in those where the df's were much higher, perhaps 15 or greater. — IRTFM, May 13 '18 at 00:23
Except trying to access those in the tibble brings up only NA's. Color me puzzled. `results[ results$df == 20 , ]` # A tibble: 0 x 4 — IRTFM, May 13 '18 at 00:30
There are only 5 groups - control and A-D - so there can't be more than 4 degrees of freedom (ie `max(results$df) == 4`). What's the focus on 20? — andrew_reece, May 13 '18 at 00:42
If you do `table( results$df )` you get a much larger range of results. I'm wondering which of the F-statistic degrees of freedom (group or residuals) was being reported, ... oh never mind, I used the first set of `results`. — IRTFM, May 13 '18 at 02:22
Thank you so much andrew_reece for your updated answer. This is really useful to me. To finalize my initial wish of “the good handling” of commands, I was trying to use your code with a non-parametric test. With kruskal.test it works fine, but when implementing a posthoc test to get pairwise comparisons, it is not working: … model = map(data, ~ dunnTest(LogFC ~ condition, data = .x, method="bh")) I get the following error message : Error in mutate_impl(.data, dots) : Evaluation error: incorrect number of dimensions. — SkyR, May 13 '18 at 19:24
I know I am stubborn, but I am still stuck with using t.test in your code to get a table overview of pairwise comparisons. When i use t.test in your example code (instead of aov), i get the following error : Error in mutate_impl(.data, dots) : Evaluation error: grouping factor must have exactly 2 levels. If I use b<- subset(b, condition == "control" | condition == "treatmentA"), then I get still an error : Error in FUN(X[[i]], ...) : object 'term' not found… — SkyR, May 13 '18 at 19:34
Try to understand the solution here, rather than just replacing functions with other functions. For example, you can see that `term` isn't even included in the original answer - that's because the outputs of `aov` and `TukeyHSD` are different. All you need to do to discover this is remove the `select` at the end and observe what the function outputs are. As for t-tests, it's not clear why you'd want them if you also want non-parametric tests. — andrew_reece, May 13 '18 at 19:49
This answer solves your problem of how to automate grouped pairwise comparisons. It provides comparison-adjusted p-values (which t-tests don't give you), _and_ the whole point of HSD as a post-hoc test is that it reports a test statistic based on the Studentized range distribution, which approximates the _t_ distribution anyway. If you have questions about when and how to employ parametric vs non-parametric tests, post to CrossValidated. If you have additional questions about how to use other inferential tests from specific packages, open a new question on StackOverflow. — andrew_reece, May 13 '18 at 19:52

IRTFM · Answer 2 · 2018-05-13T00:07:20.060

You have 25 animals in 5 different treatment groups with a varying number of gen-values (presumably activities of genetic probes) in two different tissues:

table(b$animal, b$condition)

                    control treatmentA treatmentB treatmentC treatmentD
  animalcontrol1       1005          0          0          0          0
  animalcontrol2        857          0          0          0          0
  animalcontrol3        959          0          0          0          0
  animalcontrol4        928          0          0          0          0
  animalcontrol5       1005          0          0          0          0
  animaltreatmentA1       0        927          0          0          0
  animaltreatmentA2       0        883          0          0          0
  animaltreatmentA3       0        908          0          0          0
  animaltreatmentA4       0        861          0          0          0
  animaltreatmentA5       0        927          0          0          0
  animaltreatmentB1       0          0        943          0          0
  animaltreatmentB2       0          0        841          0          0
  animaltreatmentB3       0          0        943          0          0
  animaltreatmentB4       0          0        910          0          0
  animaltreatmentB5       0          0        943          0          0
  animaltreatmentC1       0          0          0        742          0
  animaltreatmentC2       0          0          0        724          0
  animaltreatmentC3       0          0          0        702          0
  animaltreatmentC4       0          0          0        698          0
  animaltreatmentC5       0          0          0        742          0
  animaltreatmentD1       0          0          0          0        844
  animaltreatmentD2       0          0          0          0        776
  animaltreatmentD3       0          0          0          0        812
  animaltreatmentD4       0          0          0          0        783
  animaltreatmentD5       0          0          0          0        844

Agree you need to "automate" this in some fashion, but I think you are in need of a more general strategy for statistical inference rather than trying to pick out relationships by applying individual t-tests. You might consider either mixed models or one of the random forest variants. I think you should be discussing this with a statistician. As an example of where your hopes are not going to be met, take a look at the information you have about the first "gen" among the 1131 values:

str( b[ b$gen == "dghwg1041", ])
'data.frame':   13 obs. of  5 variables:
 $ animal   : Factor w/ 25 levels "animalcontrol1",..: 1 6 11 2 7 12 3 8 13 14 ...
 $ gen      : Factor w/ 1131 levels "dghwg1041","dghwg1086",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ condition: Factor w/ 5 levels "control","treatmentA",..: 1 2 3 1 2 3 1 2 3 3 ...
 $ tissue   : Factor w/ 2 levels "brain","heart": 1 1 1 1 1 1 1 1 1 1 ...
 $ LogFC    : num  4.34 2.98 4.44 3.87 2.65 ...

You do have a fair number with "complete representation:

gen_length <- ave(b$LogFC, b$gen, FUN=length)
Hmisc::describe(gen_length)
#--------------
gen_length 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
   21507        0       18    0.976    20.32    4.802       13       14 
     .25      .50      .75      .90      .95 
      18       20       24       25       25 

Value          5     8     9    10    12    13    14    15    16    17
Frequency    100    48   288   270    84   624   924  2220    64   527
Proportion 0.005 0.002 0.013 0.013 0.004 0.029 0.043 0.103 0.003 0.025

Value         18    19    20    21    22    23    24    25
Frequency    666  2223  3840    42   220  1058  3384  4925
Proportion 0.031 0.103 0.179 0.002 0.010 0.049 0.157 0.229

You might start by looking at all the "gen"s that have complete data:

head( gen_tbl[ gen_tbl == 25 ], 25)
#------------------
   dghwg1131     dghwg546     dghwg591     dghwg636     dghwg681 
          25           25           25           25           25 
    dghwg726    dgkuck196    dgkuck286    dgkuck421    dgkuck691 
          25           25           25           25           25 
   dgkuck736 dgkukdgse197 dgkukdgse287 dgkukdgse422 dgkukdgse692 
          25           25           25           25           25 
dgkukdgse737       djh592       djh637       djh682       djh727 
          25           25           25           25           25 
   dkgkjd327    dkgkjd642    dkgkjd687    dkgkjd732  fckjhghw204 
          25           25           25           25           25

big dataframe: "repeated" t-test between groups for thousand of factors

2 Answers2