2

I am using R to do some statistics, this question is duplicated from stats exchange where it was closed as it is not really s stats question, so I thought it might be more relevant to stack overflow (https://stats.stackexchange.com/questions/441638/how-do-i-run-only-a-subset-of-comparisons-in-a-t-test-using-r/441674#441674). Although the answer given here (to subset the data and then run the test) seems to be logically correct I can not see a way to do this without repeating 100 different pieces of code for each individual glycan (see below) :

I have generated a data.frame from raw data. The data includes a numeric variable (fold_change) and two factor variables (dis_status which includes RF and con, as well as glycan which includes 100 different glycans)

Here is a reproducible example with only 3 glycans and 3 "RF" and 3 "con" per glycan.

   > example
   dis_status glycan fold_change
1          RF      a  4.83433185
2          RF      a  3.88519084
3          RF      a  2.80368849
4         con      a  0.94730194
5         con      a  1.91278688
6         con      a  1.23225002
7          RF      b  4.07173876
8          RF      b  5.70383491
9          RF      b  0.05282291
10        con      b  1.34631723
11        con      b  4.26723583
12        con      b  4.26723583
13         RF      c  2.20887813
14         RF      c  4.62220094
15         RF      c  0.94730194
16        con      c  0.53597973
17        con      c  2.92572685
18        con      c  1.58871049

> dput(example)
structure(list(dis_status = structure(c(2L, 2L, 2L, 1L, 1L, 1L, 
2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L), .Label = c("con", 
"RF"), class = "factor"), glycan = structure(c(1L, 1L, 1L, 1L, 
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("a", 
"b", "c"), class = "factor"), fold_change = c(4.834331853, 3.885190842, 
2.803688487, 0.947301944, 1.912786879, 1.232250023, 4.071738761, 
5.703834911, 0.052822912, 1.346317234, 4.267235834, 4.267235834, 
2.208878135, 4.622200944, 0.947301944, 0.535979733, 2.925726849, 
1.588710491)), class = "data.frame", row.names = c(NA, -18L))

I can run a t.test on the data:

ad_nonpaired <- pairwise.t.test(stats_df$fold_change,  stats_df$dis_status:stats_df$glycan, 
                               paired = F,
                               pool.sd = F,
                               p.adj = "none")

I will correct for multiple comparisons next but the issue I am having is this carries out t.tests between every possible combination of dis_status and glycan.

I am only interested in the "RF" vs "con" for each individual glycan. So with the three glycans above, I only really want "x" from "RF" compared to "x" from "con" NOT any comparison between "x" to "y" but can not figure out how to specify this in the test?

R version 3.5.2 (2018-12-20)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_NZ.UTF-8/en_NZ.UTF-8/en_NZ.UTF-8/C/en_NZ.UTF-8/en_NZ.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] knitr_1.25          broom_0.5.2         ggrepel_0.8.1       readxl_1.3.1        forcats_0.4.0       stringr_1.4.0       dplyr_0.8.3         purrr_0.3.3        
 [9] readr_1.3.1         tidyr_1.0.0         tibble_2.1.3        ggplot2_3.2.1       tidyverse_1.2.1     limma_3.38.3        hexbin_1.27.3       vsn_3.50.0         
[17] Biobase_2.42.0      BiocGenerics_0.28.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2            lubridate_1.7.4       lattice_0.20-38       gtools_3.8.1          rprojroot_1.3-2       assertthat_0.2.1      zeallot_0.1.0         digest_0.6.22        
 [9] utf8_1.1.4            plyr_1.8.4            R6_2.4.0              cellranger_1.1.0      backports_1.1.5       evaluate_0.14         highr_0.8             httr_1.4.1           
[17] pillar_1.4.2          gplots_3.0.1.1        zlibbioc_1.28.0       rlang_0.4.1           lazyeval_0.2.2        curl_4.2              rstudioapi_0.10       gdata_2.18.0         
[25] preprocessCore_1.44.0 desc_1.2.0            labeling_0.3          splines_3.5.2         munsell_0.5.0         xfun_0.10             compiler_3.5.2        modelr_0.1.5         
[33] pkgconfig_2.0.3       tidyselect_0.2.5      fansi_0.4.0           crayon_1.3.4          withr_2.1.2           bitops_1.0-6          grid_3.5.2            nlme_3.1-141         
[41] jsonlite_1.6          gtable_0.3.0          lifecycle_0.1.0       affy_1.60.0           magrittr_1.5          scales_1.0.0          KernSmooth_2.23-16    cli_1.1.0            
[49] stringi_1.4.3         affyio_1.52.0         testthat_2.2.1        xml2_1.2.2            ellipsis_0.3.0        generics_0.0.2        vctrs_0.2.0           tools_3.5.2          
[57] glue_1.3.1            hms_0.5.2             pkgload_1.0.2         yaml_2.2.0            colorspace_1.4-1      BiocManager_1.30.9    caTools_1.17.1.2      rvest_0.3.4          
[65] haven_2.1.1          
reubenmcg
  • 371
  • 4
  • 18
  • 2
    Can you provide a reproducible example of your data ? It will make things easier for people trying to help you (see: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – dc37 Jan 06 '20 at 21:38

3 Answers3

2

You could split the data frame by glycan then t-test by the dis_status group without any external libraries:

results <- do.call("rbind", lapply(split.data.frame(df, df$glycan), 
                  function(x) {
                    pairwise.t.test(x$fold_change, x$dis_status,
                                    paired = FALSE, pool.sd = FALSE, 
                                    p.adj = "none") -> test;
                    as.numeric(tapply(x$fold_change, x$dis_status, mean)) -> ta;
                    data.frame(glycan = as.character(x$glycan[1]), 
                               mean.con = ta[1],
                               mean.RF = ta[2],
                               pvalue = as.numeric(test$p.value));
                  }))

Which gives the data frame you wanted, according to the comments

results
  glycan mean.con  mean.RF     pvalue
a      a 1.364113 3.841070 0.03403083
b      b 3.293596 3.276132 0.99335164
c      c 1.683472 2.592794 0.52325471
Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • 1
    Thanks, this seem to work well. Could you perhaps update the answer to work with the "pairwise.t.test" parameters I used in the question? I will be using the pairwise.t.test and changing the parameters therein, but if I simply replace the code it seems to not work? – reubenmcg Jan 06 '20 at 22:44
  • 1
    Done. It actually made it simpler too. – Allan Cameron Jan 06 '20 at 22:57
  • Works great, any advice on how to extract a data frame with glycan name and p.value from this list? – reubenmcg Jan 06 '20 at 23:16
  • Thank you, slightly off topic but do you happen to know why the P-value does not change if I change the p.adj = "none" to p.adj = "fdr" or any other correction. is this because the data is now parsed in such a way that it is one comparison per glycan whereas in my original code it was multiple comparisons per glycan? As in my original code the correction for multiple comparisons did yield corrected different p-values... – reubenmcg Jan 07 '20 at 00:51
  • 1
    Yes, I think it just doesn't need to do a correction for multiple hypothesis testing because it is a simple 2 group test – Allan Cameron Jan 07 '20 at 00:54
  • I know this post now has an answer, but: If I now have 3 groups in the "dis_status" column. I edit the code to add "mean.test = ta[3]" for the third group. The data frame now given has 4 rows per glycan I assume being the T.test between each of the different groups for that glycan. But they are just numbered "Glycan.1", "Glycan.2" up to 4, with one of the P-values always being "NA". Is there a way of getting the code to tell me which p.value is which comparison, rather than number them 1-4? – reubenmcg Jan 20 '20 at 22:24
1

I'm not sure that it is the perfect way but you can subset your dataframe for keeping only values for a single glycan and then perform the t-test between the each levels of dis_status:

library(tidyverse)
level_glycan = levels(example$glycan)
pvalue = NULL
for(i in level_glycan)
{
  temp <- example %>% filter(glycan == i)
  t <- pairwise.t.test(temp$fold_change, temp$dis_status, paired = FALSE, pool.sd = FALSE, p.adj = "none")
  pval <- as.numeric(t$p.value)
  pvalue <- c(pvalue, pval)
}

And you get pvalue vector containing each p value for the comparison of RF vs Con for each glycan.

> pvalue
[1] 0.03403083 0.99335164 0.52325471

Does it look what you are expecting ?


NB: Here I used the filter function from dplyr package (loaded using the tidyverse package) to subset the dataset but you can get the same subset by doing:

temp <- subset(example, glycan == i)

or

temp <- example[example$glycan == i,]
dc37
  • 15,840
  • 4
  • 15
  • 32
  • Whilst this seems to work on the exmaple, when I replace the "t.test" chunk of code with my "pairwise.t.test" chunk of code it seems not to work? – reubenmcg Jan 06 '20 at 22:35
  • 1
    I edited my answer accordingly to use the `pairwise.t.test` function. Let me if it is working now. – dc37 Jan 06 '20 at 22:39
0
It seems like you might want to use this for your general work-flow.

// Make two generic lists 
compareList <- list()
compareListTwo <- list()

//Filter through each glycan for each instance of the variable name 
for(i in 1:length(stats_df))
{
  compareList[i] = dplyr::filter(stats_df, stats_df$dis_status[i] == "RF")
}

for(j in 1:length(stats_df))
{
  compareListTwo[j] = dplyr::filter(stats_df, stats_df$dis_status[j] == "con")   

}

compareList <- unlist(compareList)
compareListTwo <- unlist(compareListTwo)

data.frame(compareList)
data.frame(compareListTwo)

newdf  <- rbind(compareList,compareListTwo)