speedy t.test on each row of an R data.table

Question

Can I use data.table's inherent speed to get a faster row-by-row t.test result, with variable column names? Below is my current code, and it takes a few seconds per every 1000 rows.

slow.diffexp <- function(dt, samples1, samples2) {
  for (i in 1:nrow(dt)) {
    if (round(i/1000)==i/1000) {
      cat(i, "\n");
    }
    a <- t.test(dt[i, samples1, with=FALSE],
                dt[i, samples2, with=FALSE]);
    set(dt, i, "tt.p.value", a$p.value)
    set(dt, i, "tt.mean1", a$estimate[1])
    set(dt, i, "tt.mean2", a$estimate[2])
  }
}

test.dt <- data.table(V1=sample(1000, 100000, replace=TRUE));
for (i in 2:20) {
  colname <- paste0("V", i);
  test.dt[ , (colname):=sample(1000, 100000, replace=TRUE)];
}
samples1 <- sample(names(test.dt), size=10);
samples2 <- setdiff(names(test.dt), samples1);
slow.diffexp(test.dt, samples1, samples2);

I have looked at the following related posts:

Paired t-test for each row of a data table: has a solution but can we get faster?
Doing t.test for columns for each row in data set: does not use data.table; also slow

I'm using set() because I have this idea that set is faster than <- for data.frames...

`data.table` is typically speedier on column-wise operations. — lmo, Jun 29 '16 at 16:21
You might get more mileage out of creating your own pared down version of `t.test.default` that does only the specific things you need. Alternatively, you could simply draw a random sample of p values, which would be almost instantaneous. — joran, Jun 29 '16 at 16:27

score 0 · Answer 1 · answered Jun 29 '16 at 20:16

This doesn't explicitly use data.table, but It should be much faster than the for loops:

set.seed(700)
test.dt <- data.table(V1=sample(1000, 100000, replace=TRUE));
for (i in 2:20) {
  colname <- paste0("V", i);
  test.dt[ , (colname):=sample(1000, 100000, replace=TRUE)];
}
samples1 <- sample(names(test.dt), size=10);
samples2 <- setdiff(names(test.dt), samples1);

system.time(myList<-apply(test.dt, 1, function(x) t.test(x[samples1], x[samples2])))
# user  system elapsed 
# 18.44    0.00   18.47 

test.dt$tt.p.value<-sapply(myList, function(x) x[[3]])
test.dt$tt.mean1<-sapply(myList, function(x) x[[5]][[1]])
test.dt$tt.mean2<-sapply(myList, function(x) x[[5]][[2]])

test.dt[1:10, 19:23, with = F]

V19 V20 tt.p.value tt.mean1 tt.mean2
962 536    0.98203    460.8    463.9
882 767    0.06294    657.4    416.0
371 111    0.73440    463.1    502.8
173 720    0.57195    595.9    513.3
126 404    0.86948    602.8    619.5
 14  16    0.63462    315.7    377.3
870 384    0.03670    377.7    626.6
142 997    0.19836    623.2    442.8
  4 193    0.99891    628.4    628.2
250 888    0.35232    590.9    498.5

The other method is about 10x slower (does 1/10th the work in a slightly longer time)

system.time(slow.diffexp(test.dt[1:10000], samples1, samples2))
# user  system elapsed 
# 22.12    0.00   22.17

speedy t.test on each row of an R data.table

1 Answers1