1

Can I use data.table's inherent speed to get a faster row-by-row t.test result, with variable column names? Below is my current code, and it takes a few seconds per every 1000 rows.

slow.diffexp <- function(dt, samples1, samples2) {
  for (i in 1:nrow(dt)) {
    if (round(i/1000)==i/1000) {
      cat(i, "\n");
    }
    a <- t.test(dt[i, samples1, with=FALSE],
                dt[i, samples2, with=FALSE]);
    set(dt, i, "tt.p.value", a$p.value)
    set(dt, i, "tt.mean1", a$estimate[1])
    set(dt, i, "tt.mean2", a$estimate[2])
  }
}

test.dt <- data.table(V1=sample(1000, 100000, replace=TRUE));
for (i in 2:20) {
  colname <- paste0("V", i);
  test.dt[ , (colname):=sample(1000, 100000, replace=TRUE)];
}
samples1 <- sample(names(test.dt), size=10);
samples2 <- setdiff(names(test.dt), samples1);
slow.diffexp(test.dt, samples1, samples2);

I have looked at the following related posts:

I'm using set() because I have this idea that set is faster than <- for data.frames...

Community
  • 1
  • 1
  • 1
    `data.table` is typically speedier on column-wise operations. – lmo Jun 29 '16 at 16:21
  • You might get more mileage out of creating your own pared down version of `t.test.default` that does only the specific things you need. Alternatively, you could simply draw a random sample of p values, which would be almost instantaneous. – joran Jun 29 '16 at 16:27

1 Answers1

0

This doesn't explicitly use data.table, but It should be much faster than the for loops:

set.seed(700)
test.dt <- data.table(V1=sample(1000, 100000, replace=TRUE));
for (i in 2:20) {
  colname <- paste0("V", i);
  test.dt[ , (colname):=sample(1000, 100000, replace=TRUE)];
}
samples1 <- sample(names(test.dt), size=10);
samples2 <- setdiff(names(test.dt), samples1);

system.time(myList<-apply(test.dt, 1, function(x) t.test(x[samples1], x[samples2])))
# user  system elapsed 
# 18.44    0.00   18.47 

test.dt$tt.p.value<-sapply(myList, function(x) x[[3]])
test.dt$tt.mean1<-sapply(myList, function(x) x[[5]][[1]])
test.dt$tt.mean2<-sapply(myList, function(x) x[[5]][[2]])

test.dt[1:10, 19:23, with = F]

V19 V20 tt.p.value tt.mean1 tt.mean2
962 536    0.98203    460.8    463.9
882 767    0.06294    657.4    416.0
371 111    0.73440    463.1    502.8
173 720    0.57195    595.9    513.3
126 404    0.86948    602.8    619.5
 14  16    0.63462    315.7    377.3
870 384    0.03670    377.7    626.6
142 997    0.19836    623.2    442.8
  4 193    0.99891    628.4    628.2
250 888    0.35232    590.9    498.5

The other method is about 10x slower (does 1/10th the work in a slightly longer time)

system.time(slow.diffexp(test.dt[1:10000], samples1, samples2))
# user  system elapsed 
# 22.12    0.00   22.17 
Bryan Goggin
  • 2,449
  • 15
  • 17