paired t-test with pairs and groups defined in another dataframe

Question

I have a dataframe which looks like this

> head(data)
               LH3003     LH3004     LH3005     LH3006     LH3007     LH3008     LH3009     LH3010     LH3011
cg18478105 0.02329879 0.08103364 0.01611778 0.01691191 0.01886975 0.01885553 0.01647439 0.02120779 0.01168622
cg14361672 0.09479536 0.07821380 0.02522833 0.06467310 0.05387729 0.05866673 0.08121820 0.10920162 0.04413263
cg01763666 0.03625680 0.04633759 0.04401555 0.08371531 0.09866403 0.17611284 0.07306743 0.12422579 0.11125146
cg02115394 0.10014794 0.09274320 0.08743445 0.08906313 0.09934032 0.18164115 0.06526380 0.08158144 0.08862067
cg13417420 0.01811630 0.02221060 0.01314041 0.01964530 0.02367295 0.01209913 0.01612864 0.01306061 0.04421938
cg26724186 0.32776266 0.31386294 0.24167480 0.29036142 0.24751268 0.26894756 0.20927278 0.28070790 0.33188921
               LH3012     LH3013     LH3014
cg18478105 0.02466508 0.01909706 0.02054417
cg14361672 0.09172160 0.06170230 0.07752691
cg01763666 0.04328518 0.13693868 0.04288165
cg02115394 0.08682942 0.08601880 0.12413149
cg13417420 0.01980470 0.02241745 0.02038114
cg26724186 0.30832389 0.27644816 0.37630038

with almost 850000 rows, and a different dataframe which contains the information behind the sample names:

> variables
   Sample_ID     Name Group01
3     LH3003     pair1       0
4     LH3004     pair1       1
5     LH3005   pair2       0
6     LH3006   pair2       1
7     LH3007    pair3       0
8     LH3008    pair3       1
9     LH3009 pair4       0
10    LH3010 pair4       1
11    LH3011 pair5       0
12    LH3012 pair5       1
13    LH3013  pair6       0
14    LH3014  pair6       1

Is it possible to do a paired t-test by defining the pairs and the group annotation of the samples based on another dataframe?

Thank you for your help!

lmo · Answer 1 · 2016-11-15T13:49:31.223

0

Here is an lapply method that will store the results of each test in a list. This assumes that each pair is adjacent in the second data.frame,df2 and the first data.frame is named df1.

myTestList <- lapply(seq(1, nrow(df2), 2),  function(i) 
                     t.test(df1[[df2$Sample_ID[i]]], df1[[df2$Sample_ID[i+1]]], paired=TRUE))

which returns

myTestList
[[1]]

    Paired t-test

data:  df1[[df2$Sample_ID[i]]] and df1[[df2$Sample_ID[i + 1]]]
t = -0.50507, df = 5, p-value = 0.635
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.03453201  0.02319070
sample estimates:
mean of the differences 
           -0.005670653 


[[2]]

    Paired t-test

data:  df1[[df2$Sample_ID[i]]] and df1[[df2$Sample_ID[i + 1]]]
t = -2.5322, df = 5, p-value = 0.05239
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.0459320947  0.0003458114
sample estimates:
mean of the differences 
            -0.02279314

data

df1 <- read.table(header=TRUE, text="LH3003  LH3004  LH3005   LH3006  LH3007  LH3008  LH3009  LH3010   LH3011
cg18478105 0.02329879 0.08103364 0.01611778 0.01691191 0.01886975 0.01885553 0.01647439 0.02120779 0.01168622
cg14361672 0.09479536 0.07821380 0.02522833 0.06467310 0.05387729 0.05866673 0.08121820 0.10920162 0.04413263
cg01763666 0.03625680 0.04633759 0.04401555 0.08371531 0.09866403 0.17611284 0.07306743 0.12422579 0.11125146
cg02115394 0.10014794 0.09274320 0.08743445 0.08906313 0.09934032 0.18164115 0.06526380 0.08158144 0.08862067
cg13417420 0.01811630 0.02221060 0.01314041 0.01964530 0.02367295 0.01209913 0.01612864 0.01306061 0.04421938
cg26724186 0.32776266 0.31386294 0.24167480 0.29036142 0.24751268 0.26894756 0.20927278 0.28070790 0.33188921")[1:4]

df2 <- read.table(header=TRUE, text="   Sample_ID     Name Group01
3     LH3003     pair1       0
4     LH3004     pair1       1
5     LH3005   pair2       0
6     LH3006   pair2       1")

edited Nov 15 '16 at 13:49

answered Nov 15 '16 at 13:41

lmo

37,904
9
56
69

should it be `ncol(df2)`? – Dirk Nachbar Nov 15 '16 at 13:57
No. df2 is the data.frame containing the column Sample_ID, which contains the names of the columns of df1. With this structure, the number of rows of df2 is equivalent to the number of columns in df1, so either may be used. – lmo Nov 15 '16 at 14:03
Thank you! Is there a way to combine all the pairs in one paired t-test? Like all the rows and then all the pairs in one paired test? If it was not paired, a mixed model could be made with Name as a random effect, but for a paired test? Or would you use a method to combine p-values (I read about the Fisher's method for this). – LHey Nov 15 '16 at 20:30
I believe the question in the comment is about a method to combine all of the tests at once in some model framework. If this is the case, it is outside of the scope of stackOverflow. [CrossValidated](http://stats.stackexchange.com/) is a better place to ask statistical methodology questions. – lmo Nov 15 '16 at 20:53
And if you would like to do the paired t-test for every row? This is similar as your answer to compare in one patient all the rows as a paired t-test, how would you proceed if you would like to compare for one row all the paired patient samples with the design of these dataframes? Since this results in 850000 t-tests, you would like to extract all the p-values and T-statistics from all the test in one dataframe with the same row names as my first data frame and with p-value and t-statistic as column names. – LHey Nov 16 '16 at 08:22
I got an error using your code: Error in .subset2(x, i, exact = exact) : subscript out of bounds – LHey Nov 16 '16 at 08:29
1

It works for the data that I pulled from your example and will extend to larger instances with the same structure. My suspicion is that your real data sets are not structured as in your example. For instance, One of the columns (or pair of columns) is missing in df1 that is present in df1. – lmo Nov 16 '16 at 12:20
I know what gave the error: although some pairs were removed from df2, it still remained in the 'memory' of the dataframe as factor. If I entered summary(df$Name) these pairs were depicted with a 0. I removed the factors by changing df$Name back to characters and than back to factors. – LHey Nov 17 '16 at 09:26

score 0 · Answer 2 · answered Nov 15 '16 at 13:55

0

You need to stack your data and define a pair column and then run the t.test, this is for 1 of the 6 tests:

data2 <- data.frame(x = c(data$LH3003, data$LH3004), pair = c(rep(0, nrow(data)), rep(1, nrow(data))))
t.test(x ~ pair, data2)

answered Nov 15 '16 at 13:55

Dirk Nachbar

542
4
16

score 0 · Answer 3 · answered Nov 15 '16 at 15:16

0

Here's a variation on @Imo's:

lapply(unique(df2$Name), function(x){
  samples <- df2[df2$Name==x,1]
  t.test(df1[,samples[1]], df1[,samples[2]], paired=T)
})

answered Nov 15 '16 at 15:16

emilliman5

5,816
3
27
37

When I run the code like this, I got this error: 'Error in `[.data.frame`(betamatrix.EPIC3.5hmC.excl, , samples[1]) : undefined columns selected'. When I run the code with a Name defined in 'samples <- df2[df2$Name=="pair1",1]', I got 6 results from the 6 pairs? Am I doing something wrong? – LHey Nov 16 '16 at 08:50
It works on a similar other dataset, I will check the data again. Thank you for this code! – LHey Nov 17 '16 at 06:56

paired t-test with pairs and groups defined in another dataframe

3 Answers3

Linked