7

I am currently trying to implement the Wilcoxon Ranksum test on multiple data sets that I've combined into one large matrix, A, that is 705x17635 (ie I want to run the ranksum test 17,635 times. The only way I've seen how to do this without using for loops is lapply, which I've run as:

> lapply(data.frame(A), function(x) 
         wilcox.test(x,b,alternative="greater",exact=FALSE,correct=FALSE))

where b is our negative control data and is a 20000x1 vector. Running this, however, takes very long (I gave up after 30 minutes), and I'm wondering if there's a quicker way to run this, especially since I can do the same process in MATLAB (even with a forloop) in about five minutes, but I need to use R for various reasons.

Karolis Koncevičius
  • 9,417
  • 9
  • 56
  • 89
  • You can replace that by `lappply(data.frame(A), wilcox.test, b, alternative="greater", exact=FALSE, correct=FALSE)` – in other words, you can omit the detour via `function`. – Konrad Rudolph Apr 10 '14 at 20:18
  • 1
    The WRS test is fundamentally more complex that t.tests. You are comparing the pairwise values and with a 20,000 value vector on one side the pairwise comparisons .... lots of CPU cycles. You might want to reconsider your analytic strategy. What do you really want to know about the differences between your fairly large control group and the much smaller (but numerous) test groups? Do you just want to know if their medians are different or perhaps also whether their 75th, 90th and 95th percentiles are materially different than those of the control? – IRTFM Apr 10 '14 at 20:36
  • (1) see if you can modify `wilcox.test` to get a stripped-down version that omits some of the input-checking (may not help that much); (2) parallelize to use multiple CPUs/cores (e.g. use `plyr::llply` with `.parallel` set to something sensible) – Ben Bolker Apr 10 '14 at 20:39
  • Try `mclapply` instead of lapply? Package `parallel`. It works only on Linux systems. – bartektartanus Apr 10 '14 at 21:05

1 Answers1

2

There are some packages which try to address this issue. i.e.:

A <- matrix(rnorm(705*17635), nrow=705)
b <- rnorm(20000)

library(matrixTests)
res <- col_wilcoxon_twosample(A, b) # running time: 83 seconds

A few lines from the result:

res[1:2,]

  obs.x obs.y obs.tot statistic    pvalue alternative location.null exact corrected
1   705 20000   20705   6985574 0.6795783   two.sided             0 FALSE      TRUE
2   705 20000   20705   7030340 0.8997009   two.sided             0 FALSE      TRUE

Check if result is the same as doing wilcox.test() column by column:

wilcox.test(A[,1], b)

    Wilcoxon rank sum test with continuity correction

data:  A[, 1] and b
W = 6985574, p-value = 0.6796
alternative hypothesis: true location shift is not equal to 0
Karolis Koncevičius
  • 9,417
  • 9
  • 56
  • 89
  • matrixTests is indeed faster, it just is a pity that it does not calculate confidence intervals. – Jariani Dec 19 '19 at 11:38
  • @Jariani I have an open issue about this [here](https://github.com/KKPMW/matrixTests/issues/2), but didn't get to the point of trying to implement it. It would slow things down if returned by default + I thought few people care about the confidence interval for the pseudo-median. – Karolis Koncevičius Dec 19 '19 at 15:47