2

I have used rcorr function of Hmisc library for calculation of correlations and p-values. Then extracted pvalues to Pval matrix and correlation coefficients to corr matrix.

Rvalue<-structure(c(1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 
0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 
1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 
1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1), .Dim = c(10L, 
10L), .Dimnames = list(c("41699", "41700", "41701", "41702", 
"41703", "41704", "41705", "41707", "41708", "41709"), c("41699", 
"41700", "41701", "41702", "41703", "41704", "41705", "41707", 
"41708", "41709")))

> Pvalue<-structure(c(NA, 0, 0, 0, 0.0258814351024321, 0, 0, 0, 0, 0, 0, 
NA, 6.70574706873595e-14, 0, 0, 2.1673942640632e-09, 1.08217552696743e-07, 
0.0105345133269157, 0, 0, 0, 6.70574706873595e-14, NA, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, NA, 2.22044604925031e-15, 0, 0, 0, 0, 
0, 0.0258814351024321, 0, 0, 2.22044604925031e-15, NA, 0, 0, 
0, 0.000322310440723728, 0.00298460759118657, 0, 2.1673942640632e-09, 
0, 0, 0, NA, 0, 0, 0, 0, 0, 1.08217552696743e-07, 0, 0, 0, 0, 
NA, 0, 0, 0, 0, 0.0105345133269157, 0, 0, 0, 0, 0, NA, 0, 0, 
0, 0, 0, 0, 0.000322310440723728, 0, 0, 0, NA, 0, 0, 0, 0, 0, 
0.00298460759118657, 0, 0, 0, 0, NA), .Dim = c(10L, 10L), .Dimnames = list(
c("41699", "41700", "41701", "41702", "41703", "41704", "41705", 
"41707", "41708", "41709"), c("41699", "41700", "41701", 
"41702", "41703", "41704", "41705", "41707", "41708", "41709"
)))

Then I converted corr matrix to Boolean matrix (0,1) which number one means good correlation. Then I want to math good correlations with significant pvalues. I need an edge list including the p-value. I implemented following code:

n=1
m=list()
for(i in 1:nrow(Rvalue))
  {
  for (j in 1:nrow(Rvalue))
    {
if (i<j & Pvalue[i,j]<0.05 & Rvalue[i,j]==1)
      {
      m[[n]]<-c(rownames(Rvalue)[i], colnames(Rvalue)[j], signif(Pvalue[i,j], digits = 4))
        n=n+1  
             }
      }
      print(i)
  }

then, then output is:

> m
[[1]]
[1] "41699" "41700" "0"    

[[2]]
[2] "41699" "41701" "0"    

[[3]]
[3] "41699" "41702" "0"    

[[4]]
[4] "41699" "41704" "0" 
...

Result is OK, but since the matrices are very big, it needs much time. How can I speed up this process? Please note that I need node names. Is there any related functions? I also have found two similar questions but not exactly what I needed (+ and +). Thanks in advance.

Community
  • 1
  • 1
user3789396
  • 112
  • 2
  • 9

2 Answers2

2

You could try

indx <- which(Rvalue==1 & Pvalue < 0.05 & !is.na(Pvalue), arr.ind=TRUE)
d1 <- data.frame(rN=row.names(Rvalue)[indx[,1]], 
               cN=colnames(Rvalue)[indx[,2]], Pval=signif(Pvalue[indx],
                                                                digits=4))

head(d1,2)
#     rN    cN Pval
#1 41700 41699    0
#2 41701 41699    0

Update

Not sure why you are getting the same result when you change the cutoff. It may be possible that the P values may be too small that it would be TRUE in the cutoffs you tried. Here is an example to show that it does return different values. Suppose, I create a function from the above code,

 f1 <- function(Rmat, Pmat, cutoff){
   indx <- which(Rmat==1 & Pmat < cutoff & !is.na(Pmat), arr.ind=TRUE)
    d1 <- data.frame(rN=row.names(Rmat)[indx[,1]], 
              cN=colnames(Rmat)[indx[,2]], Pval=signif(Pmat[indx],
                                                            digits=4))
 d1}

 f1(R1, P1, 0.05)
 #  rN cN  Pval
 #1  B  A 0.021
 #2  C  A 0.018
 #3  D  A 0.001
 #4  A  B 0.021
 #5  A  C 0.018
 #6  E  C 0.034
 #7  A  D 0.001
 #8  C  E 0.034

 f1(R1, P1, 0.01)
 #  rN cN  Pval
 #1  D  A 0.001
 #2  A  D 0.001

 f1(R1, P1, 0.001)
 #[1] rN   cN   Pval
 #<0 rows> (or 0-length row.names)

data

set.seed(24)
R1 <- matrix(sample(c(0,1), 5*5, replace=TRUE), 5,5, 
            dimnames=list(LETTERS[1:5], LETTERS[1:5]))
R1[lower.tri(R1)] <- 0
R1 <- R1+t(R1)
diag(R1) <- 1


set.seed(49)
P1 <- matrix(sample(seq(0,0.07, by=0.001), 5*5, replace=TRUE), 5, 5,
       dimnames=list(LETTERS[1:5], LETTERS[1:5]))

P1[lower.tri(P1)] <- 0
P1 <- P1+t(P1)
diag(P1) <- NA
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanksssssss. It was great. It took just 10 seconds for a matrix of 18000*18000. Unexpected. Your hack motivated me to ask a question: How can I upgrade the coding and get rid of for loops? Any referenes or experience? – Sadegh Dec 29 '14 at 20:15
  • I came up with a problem with your code. When I change pvalue cuttoff result doesn't change? – user3789396 Dec 29 '14 at 20:50
  • @user3079143 Sorry, I didn't understand your question. Isn't this removing/get rid off `for` loops? – akrun Dec 30 '14 at 03:44
  • @user3079143 If you check the code especially the `indx`, it is giving you the `row/col` where the conditions are met. It is possible that the conditions met for a range of `pvalues`. – akrun Dec 30 '14 at 03:46
  • Dear akrun, your code is a beautiful way to remove for loops. I have asked a general question. Could you recommend me how to upgrade my above coding to something like you have written. Should I learn more functions? Should I refer to special references? – user3789396 Dec 30 '14 at 06:34
  • @user3789396 Sorry, I misunderstood your question. I would recommend you to practise more because in any programming language, practise is the key. Try to read/solve questions in `stackoverflow`, `Rmailing list` which would `upgrade` you more than just reading some books. – akrun Dec 30 '14 at 06:36
  • Dear akrun, thanks u for helping on the codes and also your applied recommendations. I'm keeping my fingers crossed for you. – user3789396 Dec 30 '14 at 06:42
1

Since your matrix has a large number of columns and rows, that would be a good idea to avoid simultaneous "for loop". You can instead use mapply function which is more handy.

mapply(FUN = NULL , ...)

instead of FUN use the following function:

myf= function(x){ x "les then threshold"}

You can use mapply(FUN = myf , "Your Matrix") twice to check if the elements of two correlation and pvalue matrices agree with threshold. Store the results in two boolean matrices, P1 and P2. Then multiply P1 and P2 (direct multiplication).

myf1 = function(x) {x<0.05} myf2 = function(x) {x>0.7}

P1 = mapply(FUN = myf1 , matP)

P2 = mapply(FUN = myf2 , matR)

P = P1 * P2

The elements in P which are labeled as "True" are the desired nodes. It will work fine!

And here there is the result for your smaple:

P1 = mapply(FUN = myf1 , Pvalue)
P2 = mapply(FUN = myf2 , Rvalue)
P = P1 * P2

NA 1 1 1 0 1 1 0 1 1 1 NA 0 0 0 0 0 0 1 1 1 0 NA 1 0 1 1 1 1 1 1 0 1 NA 0 1 1 0 1 1 0 0 0 0 NA 1 0 1 0 0 1 0 1 1 1 NA 1 1 1 1 1 0 1 1 0 1 NA 1 1 1 0 0 1 0 1 1 1 NA 0 0 1 1 1 1 0 1 1 0 NA 1 1 1 1 1 0 1 1 0 1 NA

PNS
  • 82
  • 1
  • 6
  • I am not sure how this `works`. Please do show using the OP's datasets. – akrun Dec 30 '14 at 06:03
  • You can consider the following codE: myf1= function(x){ x < 0.05 } myf2= function(x){ x > 0.7 } mat1_p <- matrix(sample(c(0,1), 100*100, replace=TRUE) , 100, 100) mat2_r <- matrix(sample(c(0,1), 100*100, replace=TRUE) , 100, 100) P1 = mapply(FUN = myf1 , mat1_p) P2 = mapply(FUN = myf2 , mat2_r) P = P1 * P2 – PNS Dec 30 '14 at 06:22
  • But, this is not what you showed in the post especially, the `P1` and `P2` – akrun Dec 30 '14 at 06:25
  • It just took 3 sec for a matrix if size 100000 * 100000 on a pc with celeron 2.1 GHZ and 1 GB RAM. It is supposed to run on clusters in miliseconds. – PNS Dec 30 '14 at 06:26
  • Yes, that was a msitake. – PNS Dec 30 '14 at 06:26
  • @PaymanNickchi I am not sure it gives the same as the expected result the OP seeks. For example, using my dataset `P1, R1` ie. `myf1 <- function(x) {x < 0.05}; myf2 <- function(x) {x==1}; mapply(myf1, P1)*mapply(myf2, R1)` You can compare it with the result I got. – akrun Dec 30 '14 at 06:30
  • @PaymanNickchi I would like to add that you don't even need `mapply` here. `c(myf1(P1)*myf2(R1))` would give the same result as your code. – akrun Dec 30 '14 at 06:53
  • mapply seems to work faster than using c(myf1(P1)*myf2(P2)). the difference in time for two little dimensional matrices is small but it definitely large in big matrices.I have pasted the results – PNS Dec 30 '14 at 07:40
  • I didn't do any benchmarks. All the apply family functions are some kind of loops, so it is surprising. Having said that, the output you got is not the expected output the OP wanted. – akrun Dec 30 '14 at 07:43
  • You are right that `mapply` is faster. Here, `mapply` is applying the function to each individual element of the matrix compared to `myf1(P1)` applying to the matrix as a whole. So, perhaps, individually applying is faster than the whole matrix approach. – akrun Dec 30 '14 at 07:55
  • Yes, mapply is faster than "for loop". you can see the page http://stackoverflow.com/questions/5533246/why-is-apply-method-slower-than-a-for-loop-in-r – PNS Dec 30 '14 at 08:08
  • But I run both of codes, your code by considering which function and my code by considering mapply. The cluster I run was a little busy but your code runs in about 44 sec and my code in about 5 min. It seems the ranking is which, apply and finally for. – PNS Dec 30 '14 at 08:09
  • A properly done `for` loop should be similar to `mapply`. What I meant is that `memory` allocation if properly done in `for` loop, there wouldn't be much different. – akrun Dec 30 '14 at 08:09
  • apply functions are a little faster than for loops and avoid the memory to became full. They dynamically allocate memory to program hence the program will not stop. tapply is a special case in the family and uses C library functions an is significantly faster than other functions – PNS Dec 30 '14 at 08:11
  • any ways your code was great, Thank you for sharing. – PNS Dec 30 '14 at 08:11
  • You have to initialize a list or something `lst <- vector('list', length(1e4))` etc.. – akrun Dec 30 '14 at 08:11
  • Memory allocation is much better handled using apply functions, – PNS Dec 30 '14 at 08:12