0

I am trying to compute a 300,000x300,000 matrix in R, my codes are working quite well but it's been running for days now, how can i make it more efficient and time saving?

My codes are working well but it has been running for days now, attached are a subset of what I'm working with, the ID extends to 300,000; how can i make the codes run faster in minutes as it has been running for days now.

fam <- structure(list(ID = c(1L, 2L, 3L, 4L, 6L, 5L, 7L), dad = c(0L, 
0L, 1L, 1L, 1L, 3L, 5L), mum = c(0L, 0L, 0L, 2L, 4L, 4L, 6L), 
    GEN = c(1L, 1L, 2L, 2L, 3L, 3L, 4L)), class = "data.frame", row.names = c(NA, 
-7L))

hom<-function(data) { 
    library(Matrix)
    library(foreach)
    n<-max(as.numeric(fam[,"ID"])) 
    t<-min(as.numeric(fam[,"ID"])) 
    A<-Matrix(0,nrow=n,ncol=n, sparse=TRUE)

    while(t <=n) {

s<-max(fam[t,"dad"],fam[t,"mum"]) 
d<-min(fam[t,"dad"],fam[t,"mum"])
if (s>0 & d>0 ) 
{ 
  if (fam[t,"GEN"]==999 & s!=d) 
  { warning("both dad and mum should be the same, different for at least       one individual")
    NULL    
  }

  A[t,t]<- 2-0.5^(fam[t,"GEN"]-1)+0.5^(fam[t,"GEN"])*A[fam[t,"dad"],fam[t,"mum"]]
  foreach(j = 1:(t-1), .verbose=TRUE,  .combine='c', .packages=c("Matrix", "foreach")) %do%

    { 
      A[t,j]<- 0.5*(A[j,fam[t,"dad"]]+A[j,fam[t,"mum"]])
      A[j,t]<- A[t,j] 
    } 
} 
if (s>0 & d==0 )
{ 
  if ( fam[t,"GEN"]==999) 
  { warning("both dad and mum should be the same, one parent equal to zero for at least individual")
    NULL }
  A[t,t]<- 2-0.5^(fam[t,"GEN"]-1) 
  foreach(j = 1:(t-1), .verbose=TRUE,  .combine='c', .packages=c("Matrix", "foreach")) %do%  
    { 
      A[t,j]<-0.5*A[j,s]
      A[j,t]<-A[t,j] 
    } 
} 
if (s==0 )
{
  A[t,t]<- 2-0.5^(fam[t,"GEN"]-1)
}

cat(" MatbyGEN: ", t ,"\n") 
t <- t+1


} 

A

}

Output of the above example
%%MatrixMarket matrix coordinate real symmetric
7 7 26
1 1 1
3 1 .5
4 1 .5
5 1 .75
6 1 .5
7 1 .625
2 2 1
4 2 .5
5 2 .25
6 2 .25
7 2 .25
3 3 1.5
4 3 .25
5 3 .375
6 3 .875
7 3 .625
4 4 1.5
5 4 1
6 4 .875
7 4 .9375
5 5 1.8125
6 5 .6875
7 5 1.25
6 6 1.78125
7 6 1.234375
7 7 1.91796875

The issue is getting it to work faster for a matrix of 300k x 300k, this would take days or weeks to run as i have been running it for a while now, what can i do to make it run faster?

N.B: save the example as "anything.txt", then read the file in as "fam <- read.delim(, header = TRUE, sep="")"

Victor
  • 51
  • 6
  • Hi @Victor, you may want to put your code at the top in its own code block as it is hard to see it currently. I didn't realize it was there until my second look. – Russ Thomas Aug 01 '19 at 03:31
  • 4
    Instead of asking help to debug and optimise your code, why not explain what is your desired outcome for your matrix. – Adam Quek Aug 01 '19 at 04:00
  • Hi Victor, you can get quick performance gains changing some of the `if` to `else if`. https://www.datamentor.io/r-programming/if-else-statement/ . Long term, you need to vectorize your algorithm. – M.Viking Aug 01 '19 at 04:22
  • Hello Viking, how can i vectorize the foreach loop algorithm? – Victor Aug 01 '19 at 04:46
  • 1
    Please refer to the comment by Adam above. If you want help explain your algorithm and the expected outcome. It's unlikely that anyone will spend the effort to try and understand what your code is doing. It's very likely that your code needs to be rewritten from scratch to achieve efficiency. – Roland Aug 01 '19 at 07:27
  • Hello Roland, this matrix helps us to identify the relationship between the ID, father, mother through their generation count this means that if we know the parents, each parents should provide a random sample of one out of every two alleles (genes). One allele (gene) would be randomly sampled from the two from each parent to create a new offspring. Roland, what can you please help with about the efficiency? – Victor Aug 01 '19 at 15:13
  • 1
    @Victor, I tried looking at this but making a 4,000 X 4,000 matrix is too much. Plus I had errors when copying and pasting your code. You should target a 10 x 10 matrix and make up a data. Maybe even 30 x 30 if you're crazy. – Cole Aug 01 '19 at 23:33
  • @Cole, I have tried making a 15x15 matrix and it worked successfully, just looking for a way to make the code efficient. What errors did you encounter? – Victor Aug 02 '19 at 09:15
  • 1
    Im saying replace your current example with something with only 15 records. Or even 100. – Cole Aug 02 '19 at 10:36
  • @Cole, i have updated it with a smaller example (7) and output, this took 0.3 secs to run – Victor Aug 02 '19 at 14:24
  • What have you tried so far? - have you enhanced your function? ... I see your smaller example (7), but that data is not in an easy to use format, [can you put it so we can cut and paste](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), such as `dput(data)`? ... Why are you using `foreach()`? That function has a lot of overhead, plus `.verbose=TRUE` adds **even more** overhead, and for what you've pasted, the whole line can be replaced with `for(j in 1:(t-1))`. ... Are you measuring execution time with `microbenchmark()`? – M.Viking Aug 02 '19 at 23:01
  • @Viking, i have put it in a usable format, i am using foreach because of parallelization, i used for statement initially but it made the system freeze, think the data was too much for it to handle but with foreach, it is working but isn't fast as i'd like it to be. I read about vectorizing forloops but i haven't gotten a hang of it. do you know how to vectorize a forloop in r? – Victor Aug 02 '19 at 23:26
  • 1
    Thank you for providing a smaller number. Still, I get an error in the ```foreach(j = 1:(t-1)...``` line. ```%:% was passed an illegal right operand```. I changed it to ```%do%``` and it works. – Cole Aug 03 '19 at 00:25
  • @cole glad it worked at your end, do you have an idea of how to make it more efficient? I have read about vectorizing but I am not having a hang of that – Victor Aug 03 '19 at 02:12

1 Answers1

2

The problem you have is that this is recursive. Each loop depends on the previous loop's results. Therefore, you can't really use vectorization to solve the problem.

If you want to use R for this, you're best bet is to look into Rcpp. I'm not that good with Rcpp but I do have some suggestions.

The easiest thing to do is to get rid of the foreach loop and replace it with a regular for loop. There's a lot of overhead to use parallel threads and when a function is recursive, it's hard for the workers to really do better on their own.

# Before

foreach(j = 1:(t-1),  .combine='c', .packages=c("Matrix", "foreach")) %do%
{ ... }

# After
for (j in 1:(t-1)) {
...
}

The next thing to do is to contemplate whether you really need a sparse matrix. If you're not having memory problems, you might as well use a regular matrix.

A<-Matrix(0,nrow=n,ncol=n, sparse=TRUE)
# to
A<-matrix(0,nrow=n,ncol=n)

The last thing to do is to rethink how you initialize everything. Parts of that code gets repeated multiple times like the assignment to the diag. Since we're summing separate elements, we can initialize the diag with the part common to all 3 code snippets 2 - 0.5^(fam[t, 'GEN'] - 1).

A<-matrix(0,nrow=n,ncol=n)
diag(A) <- 2-0.5^(fam[["GEN"]]-1)

This is important because that allows us to skip ahead. Your original code snippet had like, 1,000 rows with 0s for 'mum' and 'dad'. With this initialization, we can skip right ahead to the first row with a non-zero result for 'mum' or 'dad':

  t_start <- min(which.max(fam$dad > 0), which.max(fam$mum > 0))
  t_end <- max(fam[['ID']])

  for (t in t_start:t_end) {
...
}

I decided in the interest of skipping if statements, I wanted to use sum(c(..., ...)) to sum up everything. That way, if the subset resulted in a NULL, I could still sum. Altogether:

  t_start <- min(which.max(fam$dad > 0), which.max(fam$mum > 0))
  t_end <- max(fam[['ID']])

  A<-matrix(0,nrow=t_end,ncol=t_end)
  diag(A) <- 2-0.5^(fam[["GEN"]]-1)

  for (t in t_start:t_end) {
    A[t,t]<- sum(c(A[t,t], 0.5^(fam[t,"GEN"])*A[fam[t,"dad"],fam[t,"mum"]]))

    for(j in 1:(t-1))  {
      A[t,j]<- 0.5 * sum(c(A[j,fam[t,"dad"]],A[j,fam[t,"mum"]]))
      A[j,t]<- A[t,j]
    }
  }
  A

Performance

Unit: microseconds
                expr       min         lq      mean    median        uq     max neval
            original 85759.901 86650.7515 88776.695 88740.050 90529.750 97433.2   100
         non_foreach 47912.601 48528.5010 50699.867 50220.901 51782.651 88355.1   100
 non_sparse_non_each  1423.701  1454.3015  1531.833  1471.451  1496.401  4126.3   100
        final_change   953.102   981.8015  1212.264  1010.500  1026.052 21350.1   100

All code

fam <- structure(list(ID = c(1L, 2L, 3L, 4L, 6L, 5L, 7L), dad = c(0L, 
                                                                  0L, 1L, 1L, 1L, 3L, 5L), mum = c(0L, 0L, 0L, 2L, 4L, 4L, 6L), 
                      GEN = c(1L, 1L, 2L, 2L, 3L, 3L, 4L)), class = "data.frame", row.names = c(NA, 
                                                                                                -7L))
A<-matrix(0,nrow=7,ncol=7)
diag(A) <- 2-0.5^(fam[["GEN"]]-1)

t_start <- min(which.max(fam$dad > 0), which.max(fam$mum > 0))
t_end <- max(fam[['ID']])

for (t in t_start:t_end) {
  A[t,t]<- sum(c(A[t,t], 0.5^(fam[t,"GEN"])*A[fam[t,"dad"],fam[t,"mum"]]))

  for(j in 1:(t-1))  {
    A[t,j]<- 0.5 * sum(c(A[j,fam[t,"dad"]],A[j,fam[t,"mum"]]))
    A[j,t]<- A[t,j]
  }
}
A

hom<-function(data) { 
  library(Matrix)
  library(foreach)
  n<-max(as.numeric(fam[,"ID"])) 
  t<-min(as.numeric(fam[,"ID"])) 
  A<-Matrix(0,nrow=n,ncol=n, sparse=TRUE)

  while(t <=n) {

    s<-max(fam[t,"dad"],fam[t,"mum"]) 
    d<-min(fam[t,"dad"],fam[t,"mum"])
    if (s>0 & d>0 ) 
    { 
      if (fam[t,"GEN"]==999 & s!=d) 
      { warning("both dad and mum should be the same, different for at least       one individual")
        NULL    
      }

      A[t,t]<- 2-0.5^(fam[t,"GEN"]-1)+0.5^(fam[t,"GEN"])*A[fam[t,"dad"],fam[t,"mum"]]
      foreach(j = 1:(t-1),  .combine='c', .packages=c("Matrix", "foreach")) %do%

      { 
        A[t,j]<- 0.5*(A[j,fam[t,"dad"]]+A[j,fam[t,"mum"]])
        A[j,t]<- A[t,j] 
      } 
    } 
    if (s>0 & d==0 )
    { 
      if ( fam[t,"GEN"]==999) 
      { warning("both dad and mum should be the same, one parent equal to zero for at least individual")
        NULL }
      A[t,t]<- 2-0.5^(fam[t,"GEN"]-1) 
      foreach(j = 1:(t-1),  .combine='c', .packages=c("Matrix", "foreach")) %do%  
      { 
        A[t,j]<-0.5*A[j,s]
        A[j,t]<-A[t,j] 
      } 
    } 
    if (s==0 )
    {
      A[t,t]<- 2-0.5^(fam[t,"GEN"]-1)
    }

    # cat(" MatbyGEN: ", t ,"\n") 
    t <- t+1


  } 

  A

}

hom2<-function(data) { 
  library(Matrix)
  n<-max(as.numeric(fam[,"ID"])) 
  t<-min(as.numeric(fam[,"ID"])) 
  A<-Matrix(0,nrow=n,ncol=n, sparse = T)

  while(t <=n) {

    s<-max(fam[t,"dad"],fam[t,"mum"]) 
    d<-min(fam[t,"dad"],fam[t,"mum"])
    if (s>0 & d>0 ) 
    { 
      if (fam[t,"GEN"]==999 & s!=d) 
      { warning("both dad and mum should be the same, different for at least       one individual")
        NULL    
      }

      A[t,t]<- 2-0.5^(fam[t,"GEN"]-1)+0.5^(fam[t,"GEN"])*A[fam[t,"dad"],fam[t,"mum"]]
      for (j in 1:(t-1)) { 
        A[t,j]<- 0.5*(A[j,fam[t,"dad"]]+A[j,fam[t,"mum"]])
        A[j,t]<- A[t,j] 
      } 
    } 
    if (s>0 & d==0 )
    { 
      if ( fam[t,"GEN"]==999) 
      { warning("both dad and mum should be the same, one parent equal to zero for at least individual")
        NULL }
      A[t,t]<- 2-0.5^(fam[t,"GEN"]-1) 
      for (j in 1:(t-1)) { 
        A[t,j]<-0.5*A[j,s]
        A[j,t]<-A[t,j] 
      } 
    } 
    if (s==0 )
    {
      A[t,t]<- 2-0.5^(fam[t,"GEN"]-1)
    }

    # cat(" MatbyGEN: ", t ,"\n") 
    t <- t+1


  } 

  A

}

hom3<-function(data) { 
  n<-max(as.numeric(fam[,"ID"])) 
  t<-min(as.numeric(fam[,"ID"])) 
  A<-matrix(0,nrow=n,ncol=n)

  while(t <=n) {

    s<-max(fam[t,"dad"],fam[t,"mum"]) 
    d<-min(fam[t,"dad"],fam[t,"mum"])
    if (s>0 & d>0 ) 
    { 
      if (fam[t,"GEN"]==999 & s!=d) 
      { warning("both dad and mum should be the same, different for at least       one individual")
        NULL    
      }

      A[t,t]<- 2-0.5^(fam[t,"GEN"]-1)+0.5^(fam[t,"GEN"])*A[fam[t,"dad"],fam[t,"mum"]]
      for (j in 1:(t-1)) { 
        A[t,j]<- 0.5*(A[j,fam[t,"dad"]]+A[j,fam[t,"mum"]])
        A[j,t]<- A[t,j] 
      } 
    } 
    if (s>0 & d==0 )
    { 
      if ( fam[t,"GEN"]==999) 
      { warning("both dad and mum should be the same, one parent equal to zero for at least individual")
        NULL }
      A[t,t]<- 2-0.5^(fam[t,"GEN"]-1) 
      for (j in 1:(t-1)) { 
        A[t,j]<-0.5*A[j,s]
        A[j,t]<-A[t,j] 
      } 
    } 
    if (s==0 )
    {
      A[t,t]<- 2-0.5^(fam[t,"GEN"]-1)
    }

    # cat(" MatbyGEN: ", t ,"\n") 
    t <- t+1


  } 

  A

}

library(microbenchmark)
f_changed = function(fam) {
  t_start <- min(which.max(fam$dad > 0), which.max(fam$mum > 0))
  t_end <- max(fam[['ID']])

  A<-matrix(0,nrow=t_end,ncol=t_end)
  diag(A) <- 2-0.5^(fam[["GEN"]]-1)

  for (t in t_start:t_end) {
    A[t,t]<- sum(c(A[t,t], 0.5^(fam[t,"GEN"])*A[fam[t,"dad"],fam[t,"mum"]]))

    for(j in 1:(t-1))  {
      A[t,j]<- 0.5 * sum(c(A[j,fam[t,"dad"]],A[j,fam[t,"mum"]]))
      A[j,t]<- A[t,j]
    }
  }
  A
}
microbenchmark(
  original = {
    hom(fam)
  }
  , non_foreach = {
    hom2(fam)
  }
  , non_sparse_non_each = {
    hom3(fam)
  }
  , final_change = {
  f_changed(fam)
  }
,times = 100
)
Cole
  • 11,130
  • 1
  • 9
  • 24
  • Thanks for your help so far @Cole, I have started to run it on a bigger data, would tell you how many mins/hours it took – Victor Aug 03 '19 at 12:50
  • thanks for your help so far but i encountered this error when i tried with a bigger data : Error in `diag<-`(`*tmp*`, value = A[2 - 0.5^(fam[["GEN"]] - 1)]) : replacement diagonal has wrong length – Victor Aug 03 '19 at 17:09
  • That code doesn't look familiar. Mine was ```diag(A)<-2 - ...``` there should be no brackets by A – Cole Aug 03 '19 at 17:23
  • Error in `diag<-`(`*tmp*`, value = 2 - 0.5^(fam[["GEN"]] - 1)) : replacement diagonal has wrong length. I still encounter the same error – Victor Aug 04 '19 at 02:52
  • There's not much I can do. Did you adjust the ```A``` matrix to be something other than a 7x7 matrix? What is the nrow of fam vs. the dimensions of A? – Cole Aug 04 '19 at 03:02
  • Yes, I tried it with my 300,000 dataset and made it sparse – Victor Aug 04 '19 at 04:06
  • Do ```nrow(A)``` and ```nrow(fam)```. They should give the same number – Cole Aug 04 '19 at 11:12
  • @Cole, nice optimization! --- The code `A[j,t]<- A[t,j]` is filling in the matrix's upper triangle, you could remove that line, and [copy the completed lower triangle to the upper triangle](https://stackoverflow.com/questions/26166569/copy-upper-triangle-to-lower-triangle-for-several-matrices-in-a-list#comment43492071_26166853) outside the loop with `A[upper.tri(A)] <- t(A)[upper.tri(A)]`. – M.Viking Aug 05 '19 at 13:36
  • @Cole, after executing the microbenchmark, you can use `all.equal(original, non_foreach, final_change)` to ensure the functions output same results. – M.Viking Aug 05 '19 at 13:45
  • I thought about the ```upper.tri``` but since the function is recursive and depends on itself, I believe I need to update ```A[j, t] <- A[t,j]``` during each loop. Still, I don't think that assignment would be much costlier than doing the ```upper.tri``` assignment - the function already has the loop. I agree, I should get in a better habit of ```all.equal```. – Cole Aug 06 '19 at 03:43
  • @Victor, did it work? Also, I think you shouldn't use a Sparse Matrix. By the end of the loop it becomes unsparse so there may be memory allocation issues with using it because your matrix is constantly requesting more memory. – Cole Aug 08 '19 at 10:23
  • @Cole, I tried it with a bigger dataset (sparsed and not-sparsed), ran into memory allotment issue, don’t know what else to do – Victor Aug 08 '19 at 12:26
  • @Cole, can we vectorize the for loops? Is it possible? – Victor Aug 08 '19 at 13:08
  • No Vectorization is not possible because of recursion. But looking back, 300,000 x 300,000 is 90 billion data elements. Each numeric element takes up 8 bytes so you'd need at least 720 GB of RAM. – Cole Aug 08 '19 at 22:49
  • getting that much ram would be a tad difficult, what do you think could be done instead? – Victor Aug 09 '19 at 01:34
  • No idea. You can search for big data or look at a different language. But this is beyond what I know. – Cole Aug 09 '19 at 10:15
  • Cole, is it possible to create this matrix using data.table? – Victor Aug 11 '19 at 06:13
  • The memory problem will still exist. You may have better luck asking another question about creating large matrices. – Cole Aug 11 '19 at 10:19