1

This is a follow up question from R: t-test over all columns

Suppose I have a huge data set, and then I created numerous subsets based on certain conditions. The subsets should have the same number of columns. Then I want to do t-test on two subsets at a time (outer loop) and then for each combination of subsets go through all columns one column at a time (inner loop).

Here is what I have come up with based on previous answer. This one stops with an error.

C <- c("c1","c1","c1","c1","c1",
   "c2","c2","c2","c2","c2",
   "c3","c3","c3","c3","c3",
   "c4","c4","c4","c4","c4",
   "c5","c5","c5","c5","c5",
   "c6","c6","c6","c6","c6",
   "c7","c7","c7","c7","c7",
   "c8","c8","c8","c8","c8",
   "c9","c9","c9","c9","c9",
   "c10","c10","c10","c10","c10")
X <- rnorm(n=50, mean = 10, sd = 5)
Y <- rnorm(n=50, mean = 15, sd = 6)
Z <- rnorm(n=50, mean = 20, sd = 5)
Data <- data.frame(C, X, Y, Z)

Data.c1 = subset(Data, C == "c1",select=X:Z)
Data.c2 = subset(Data, C == "c2",select=X:Z)
Data.c3 = subset(Data, C == "c3",select=X:Z)
Data.c4 = subset(Data, C == "c4",select=X:Z)
Data.c5 = subset(Data, C == "c5",select=X:Z)

Data.Subsets = c("Data.c1",
                 "Data.c2",
                 "Data.c3",
                 "Data.c4",
                 "Data.c5") 

library(plyr)

combo1 <- combn(length(Data.Subsets),1)
adply(combo1, 1, function(x) {

  combo2 <- combn(ncol(Data.Subsets[x]),2)
  adply(combo2, 2, function(y) {

      test <- t.test( Data.Subsets[x][, y[1]], Data.Subsets[x][, y[2]], na.rm=TRUE)

      out <- data.frame("Subset" = rownames(Data.Subsets[x]),
                    , "Row" = colnames(x)[y[1]]
                    , "Column" = colnames(x[y[2]])
                    , "t.value" = round(test$statistic,3)
                    ,  "df"= test$parameter
                    ,  "p.value" = round(test$p.value, 3)
                    )
      return(out)
  } )
} )
Community
  • 1
  • 1
ery
  • 992
  • 3
  • 14
  • 25
  • It's not entirely clear what you want your code to do. Do you mean perform a test between c1 in subset 1 and c1 in subset 2 followed by a t test on c2 in subset 1 and c2 in subset 2 etc? At a quick glance your Data.Subsets is just a character vector. It doesn't actually contain any of the dataframe subsets you've made. so using looping over it does nothing because you want to deal with dataframes and you are passing your code strings. – Davy Kavanagh Mar 12 '12 at 15:58
  • @DavyKavanagh: Yes, the Data.Subsets is just a character vector. I tried to use as.data.frame to convert it to a dataframe, but with the same result. What I wanted to do is: to capture these data subset names, and access the actual data subsets in the loop. I guess the pertinent question is: how to pass the dataframe as a parameter in the loop? – ery Mar 12 '12 at 16:24

2 Answers2

6

First of all, you can more easily define you dataset using gl, and by avoiding creating individual variables for the columns.

Data <- data.frame(
  C = gl(10, 5, labels = paste("c", 1:10, sep = "")),
  X = rnorm(n = 50, mean = 10, sd = 5),
  Y = rnorm(n = 50, mean = 15, sd = 6),
  Z = rnorm(n = 50, mean = 20, sd = 5)
)

Convert this to "long" format using melt from the reshape package. (You can also use the base reshape function.)

longData <- melt(Data, id.vars = "C")

Now Use pairwise.t.test to compute t tests on all pairs of X/Y/Z for for each level of C.

with(longData, pairwise.t.test(value, interaction(C, variable)))

Note that it is important to use pairwise.t.test rather than just lots of individual calls to t.test because you need to adjust your p values if you run lots of tests. (See, e.g., xkcd for explanation.)

In general, pairwise t tests are inferior to a regression so be careful about their usage.

Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
1

You can use get(Data.subset[x]) which will pick out the relevant data frame. But I don't think this should be necessary.

Explicitly subsetting that many times shoudn't be necessry either. You could create them using something like

conditions = c("c1", "c2", "c3", "c4", "c5")
dfs <- lapply(conditions, function(x){subset(Data, C==x, select=X:Z)})

That should (didn't test it) return a list of data frames each subseted on the various conditions you passed it.

However it would be a much better idea as @Richie Cotton points out, to reshape your data frame and use pairwise t tests.

I should point out that doing this many t-tests doesn't seem wise. Even after correction for multiple testing, be it FDR, permutation or otherwise. It would be better to try and figure out if you can use an anova of some sort as they are used for almost exactly this purpose.

Brian Diggs
  • 57,757
  • 13
  • 166
  • 188
Davy Kavanagh
  • 4,809
  • 9
  • 35
  • 50
  • Thanks! I tried the first answer above, and it works fine ... sort of. I have 522 cols each with 1522 rows. With C1 to C16 combinations, the calculation fails on "memory exhausted". Any idea how big of memory do I need for this? – ery Mar 12 '12 at 23:39