0

So I am using the R package doParallel to parallelized some steps of my script when I have to handle large list of elements to compute it faster. Since this time all of the functions I used so far were wroking perfectly well with foreach() : I just had to specify my number of cores with registerDoParallel() and that was all!

I recently tried to use different statistics tests in R using var.test() and t.test() and I don't understand why but I realized that used in foreach() it wasn't working... So to be more clear what I am basically doing is iterating over rows of 2 matrices of the same dimensions : each row , in each matrix, contains 5 numeric values and I do for example:

var.test(matrixA[1,],matrixB[1,])$p.value

to extract, for row number 1, the corresponding p.value from the Fisher test made on 10 numeric values (2 groups of 5 values in each matrix's row number 1). Problem is my matrices have millions of rows so I have to iterate over the number of rows, and I do this with the foreach() function:

p.values.res<-foreach(i=seq(dim(matrixA)[1])) %dopar%
  var.test(matrixA[i,],matrixB[i,])$p.value

(Here I set registerDoParallel(cores = 6) prior to the foreach()). I tried different tests : fisher test and student test (t.test()) and unfortunately none of them were working on my 6 cores, only one.

I also tried with "cl": registerDoParallel(cl = 4) It doesn't work either.

I tried to restart R, to quit and reopen session, to restart computer: doesn't work.

Does anybody knows why it does not work, and how to fix that ?

My configuration: Linux Mint 18.2 Cinnamon 64-bit (3.4.6); Intel Core I7-6700 CPU; R version 3.4.3 (2017-11-30); RStudio Version 1.1.383 2009-2017.

here are 2 short examples of matrices

MatrixA:

0.7111111  0.7719298  0.7027027   0.6875000  0.6857143
0.8292683  0.6904762  0.8222222   0.8333333  0.6250000
0.8846154  0.5714286  0.8928571   0.8846154  0.9259259
0.9000000  0.5000000  0.9500000   0.8666667  0.8260870
0.8235294  0.3684211  0.9411765   0.8333333  0.8000000
0.5714286  0.2142857  0.6666667   0.5000000  0.5555556

MatrixB:

0.5227273  0.7142857  0.7808219   0.6346154  0.7362637
0.9166667  0.7173913  0.8611111   0.7391304  0.7538462
0.8666667  0.6052632  0.8260870   0.7333333  0.9024390
0.9285714  0.5806452  0.8750000   0.6956522  0.8787879
0.8333333  0.5517241  0.8333333   0.6818182  0.8750000
0.7500000  0.2941176  0.6666667   0.4444444  0.7500000

Thank you all in advance for your help. Regards,

Yoann Pageaud
  • 412
  • 5
  • 22
  • It's easier to help you if you provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input data and code we can copy/paste to run. It's unclear exactly what you are trying or exactly what the error/problem is. – MrFlick Dec 04 '17 at 15:28
  • You can generate 2 matrix of random numeric values (with dots I mean like 1.2, 2.5, ...) the only thing "special" is that both matrices have the same number of rows, and only 4 columns (so 4 values by rows) I cannot copy my matrices easily currently sorry... what I am doing is iterating row by row, comparing the values from the 2 matrices A and B. Example: row 1: var.test(matrixA[1,],matrixB[1,])$p.value row 2: var.test(matrixA[2,],matrixB[2,])$p.value [...] and I keep only the p-values of each result to add it to my p.values.res list. – Yoann Pageaud Dec 04 '17 at 15:33

2 Answers2

1

I can't reproduce your problem. This works fine for me:

matrixA <- matrix(runif(36), 6)
matrixB <- matrix(runif(36), 6)

cl <- parallel::makeCluster(4)
doParallel::registerDoParallel(cl)
library(foreach)
p.values.res<-foreach(i=seq(dim(matrixA)[1])) %dopar%
  var.test(matrixA[i,],matrixB[i,])$p.value
parallel::stopCluster(cl)
F. Privé
  • 11,423
  • 2
  • 27
  • 78
  • I try asap your solution. The problem could be that I do not set the number of clusters the same way as you do in my tests. I keep you in touch. – Yoann Pageaud Dec 04 '17 at 17:03
  • Sorry. I tried, it does not work either : after few seconds everything goes on 1 CPU. Any ideas where this problem could come from ? It is only happening for fisher and student test in my case. foreach() works perfectly well with all the other functions I previously used in my experience except these ones... – Yoann Pageaud Dec 04 '17 at 17:23
  • What do you mean by `everything goes on 1 CPU`? – F. Privé Dec 04 '17 at 17:24
  • sorry, everything on one core. – Yoann Pageaud Dec 04 '17 at 17:26
  • it does not parallelize what I am running. – Yoann Pageaud Dec 04 '17 at 17:27
  • How do you assess whether it parallelizes or not? – F. Privé Dec 04 '17 at 17:33
  • In Linux Mint I installed a little widget in the bar displayed at the bottom to see in live the use of CPU, RAM, DISKs... so that can quickly check which core is working and which ones are not. I can also check that in the terminal with the command: atop 2. If I see only one core used, then it is not parallelized on 4 cores. – Yoann Pageaud Dec 04 '17 at 17:39
  • With `N <- 1e4; matrixA <- matrix(runif(36*N), 6*N); matrixB <- matrix(runif(36*N), 6*N)`, you might see something. – F. Privé Dec 04 '17 at 19:47
  • I tried with your example: so there is a moment at the very beginning after I launch the function, where I have 3 cores (I set it to 3 cores in this case) working on it, and then few seconds later I only have one core working... So parallelization is still not working with this example. I tried the var.test() and t.test() functions to compare your matrices. – Yoann Pageaud Dec 05 '17 at 08:59
  • I am starting to think that the problem might come from the dimensiosn of my matrices which are enormous (millions of rows). Maybe it complicates the setting for parallelization ? Honnestly I have no idea... – Yoann Pageaud Dec 05 '17 at 09:02
  • here are the functions I tried so that you can try them to see what it gives you : p.vals.m<-foreach(i=seq(dim(matrixA)[1])) %dopar% + t.test(matrixA[i,],matrixB[i,])$p.value p.vals.m<-foreach(i=seq(dim(matrixA)[1])) %dopar% + var.test(matrixA[i,],matrixB[i,])$p.value – Yoann Pageaud Dec 05 '17 at 09:06
  • I was trying to tell you that there is nothing wrong with the parallelization. The problem is the way you parallelize. Cores are busy doing other things than doing the computation so that you think they are doing nothing. [This](https://privefl.github.io/blog/a-guide-to-parallelism-in-r/#iterate-over-lots-of-elements.) might help you. – F. Privé Dec 05 '17 at 10:36
  • No I don't think cores are buzy. normally when I am usually parallelizing a function on 5 cores for example: all of the 4 cores are 100% working. Here when I do the same thing but for t.test() or var.test() I have only core working at 100% the 3 others are at 0% of use. it clearly does not have the same profile than usually when I use doParallel. – Yoann Pageaud Dec 05 '17 at 11:16
0

Unfortunately I didn't find any solution to my problem with doParallel but I realized that I did not have to use it in the first place.

From the R package "genefilter" I find an alternative solution using the function rowttests() that is really fast for doing t-tests on large matrix. The only comment I have against the function is that it assumes that variances are equal when calculating p-values (and you can't change that). Fortunately I am in this case.

So I just had to cbind() my 2 matrix, specify belonging groups as factors for columns. And that's all !

bind_matrix<-cbind(matrixA,matrixB)
fact<-factor(c("A","A","A","A","A","B","B","B","B","B"))
p.vals<-rowttests(bind_matrix,fact)$p.values

It takes few seconds and I tried it for a 10 millions rows matrix.

The solution is the same Fisher test, there is a function rowFtests().

So now I might ask for a speed efficient solution for Wilcoxon tests. If someone knows a function that works similarily to theses ones, comment it please.

Yoann Pageaud
  • 412
  • 5
  • 22