3

I understand that I can extract submatrices from an already created matrix but I want to be able to create submatrices first then combine the created submatrices to form a bigger matrix to save space and time. For example in my example, I want to be able to create a submatrix for IDs with NAs (1-10) and IDs without NAs(11-20) then combine the two matrices together to form a bigger matrix but I am not getting it, would like if someone can suggest what should be in my codes given that I will make same calculations for both with NAs and without NAs.

P.S: I also want to be able to save these submatrices separately before merging them together to a singular matrix (20x20)

dorm<-function(data)
{ 
  library(Matrix)
  n<-max(as.numeric(fam[,"ID"])) 
  t<-min(as.numeric(fam[,"ID"])) 
  A <- sparseMatrix(i = n, j=n, x=n)
  while(t <=n) {

    for( t in t:n ){

      s <- max(fam[t,"dad"],fam[t,"mum"]) 
      d <- min(fam[t,"dad"],fam[t,"mum"])

      if( !is.na(s) ){ 
        if( !is.na(d) ){
          A[t,t] = 2-0.5^(fam[t,"GEN"]-1)+0.5^(fam[t,"GEN"])*A[fam[t,"dad"],fam[t,"mum"]]
          tmp = 0.5 * (A[1:(t-1),s] + A[1:(t-1),d])
          A[t, 1:(t-1)] = tmp
          A[1:(t-1), t] = tmp
        } else {
          A[t,t] = 2-0.5^(fam[t,"GEN"]-1)
          tmp = 0.5 * A[1:(t-1),s]
          A[t, 1:(t-1)] = tmp
          A[1:(t-1), t] = tmp
        }
      } else {
        A[t,t] = 2-0.5^(fam[t,"GEN"]-1)
      }
      message(" MatbyGEN: ", t)
    }

    return(A)
  }
}

fam <- structure(list(ID = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 
11L, 12L, 13L, 14L, 18L, 15L, 16L, 17L, 20L, 19L), dad = c(NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, 4L, 6L, 4L, 10L, 
12L, 13L, 13L, 14L), mum = c(NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, 2L, 3L, 2L, 5L, 11L, 11L, 5L, 3L, 7L, 2L), GEN = c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 
3L, 3L, 3L)), class = "data.frame", row.names = c(NA, -20L))

A <- dorm(fam)
Uwe Keim
  • 39,551
  • 56
  • 175
  • 291
Viktor
  • 47
  • 5
  • If you are using sparse matrices, wouldn't it be simpler to create the entire matrix in one go? It requires only the values and their positions in the matrix. – Oliver Sep 29 '19 at 20:21
  • Yes, that was my idea but for a much larger data (>400k), it's been running for 2 weeks now which is quite slow which is why I am asking for this and how do I got about the values and their positions? – Viktor Sep 29 '19 at 20:33
  • This sounds like a problem in the code for generating the sparse matrix (how are you doing it?), more than something that would be fixed using smaller sub matrixes. Assuming your data has 1M variables i assume the matrix at the very least has 1 none-zero value in every row and column, this should not take much more than a few seconds to create, using the Matrix package. – Oliver Sep 29 '19 at 21:03
  • example: `n <- 1e6;d <- rnorm(n);r <- seq(n);c <- sample(r);system.time(mm <- sparseMatrix(i = r, j = c, x = d))`. takes roughly 0.24 seconsd on my little laptop. – Oliver Sep 29 '19 at 22:41
  • This seemed familiar. I think using sparse matrices ends up hurting - all of your ```NA``` results turn out to be non-sparse. See also: https://stackoverflow.com/questions/57301390/how-to-make-r-foreach-loops-efficient/57335242#57335242 – Cole Sep 29 '19 at 23:40
  • @Olivier, i think why it's slow is because of the non-zeros and it has to calculate for each id (according to the codes) which is why i had the idea of calculating for with/without NAs then combining them together at the end – Viktor Sep 30 '19 at 04:52
  • @Cole, it isn't the same because that linked page will make the R crash, it isn't time friendly nor memory efficient – Viktor Sep 30 '19 at 04:53
  • You may want to look into `bigmemory` matrices (https://cran.r-project.org/web/packages/bigmemory/index.html) if you are crashing / running out of memory trying to generate large matrices. – Richard J. Acton Sep 30 '19 at 17:55
  • it's not about running out of memory now, it's about time/efficiency in creating the matrix which is why i thought of asking on how to create submatrices then bind it together to make a bigger matrix at the end – Viktor Sep 30 '19 at 20:06
  • I don't think you can do the sub matrices first - subsequent results depend on previous results in the larger matrix. You may be able to just use the ```lower.tri``` and ```diag``` which would reduce the memory requirements by 50%. If you do another question with the original 4000 row example, I'd be interested in testing out ```Rcpp```. – Cole Oct 01 '19 at 00:15
  • Hello Cole, this is a link to a bigger data set: https://ufile.io/irrolcfo thanks – Viktor Oct 01 '19 at 16:00

2 Answers2

0

Here is an solution. It is ~50x faster on the large dataset (1 second vs. 50 seconds):

#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
using namespace arma;

// [[Rcpp::export]]
sp_mat rcpp_dorm_sp(IntegerVector ID, IntegerVector dad, IntegerVector mum, IntegerVector gen){
  int n; 
  int s; int d;

  double tmp;

  sp_mat A(dad.size(), dad.size());

  A.diag().ones();
  n = max(ID); 

  for(int t = 0; t < n; t++){
    s = std::max(dad[t], mum[t]); 
    d = std::min(dad[t], mum[t]);

    A(t,t) = 2-pow(0.5, gen[t] - 1);

    if ((s>0) & (d>0) ) { 
      A(t,t) +=  pow(0.5, gen[t])*A(dad[t]-1,mum[t]-1);
      for(int j = 0; j < t; j++){

        tmp = 0.5 * (A(j, dad[t]-1) + A(j, mum[t]-1));
        if (tmp > 0){
          A(t,j) = tmp;
          A(j,t) = tmp;
        }
      }
    } else if ((s>0) & (d==0)) {

      for(int j = 0; j < t; j++){
        tmp = 0.5 * A(j, s-1);
        if (tmp > 0){
          A(t,j) = tmp;
          A(j,t) = tmp;
        }
      }
    }
  }

  return(A);
}

And the R portion:

fam_mid <- structure(list(ID = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 
                                         11L, 12L, 13L, 14L, 18L, 15L, 16L, 17L, 20L, 19L),
                                  dad = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, 4L, 6L, 4L, 10L, 
                                          12L, 13L, 13L, 14L),
                                  mum = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2L, 3L, 2L, 5L, 11L, 11L, 5L, 3L, 7L, 2L)
                                  , GEN = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 
                                            3L, 3L, 3L)), class = "data.frame", row.names = c(NA, -20L))

rcpp_dorm_sp(fam_cpp$ID, fam_cpp$dad, fam_cpp$mum, fam_cpp$GEN)

Cole
  • 11,130
  • 1
  • 9
  • 24
  • Thanks so far, Cole! I am trying to make it sparse, still running the codes on a much larger data: replaced ```rowSums``` with ```rowsum``` and will update you – Viktor Oct 02 '19 at 16:14
  • @Viktor - see edit. I translated your original solution into ```Rcpp```. The output is a sparse matrix and calculates your larger dataset in 1 second on my computer. – Cole Oct 16 '19 at 02:17
  • how are you doing? Thanks!!! This is beautiful and I don't know if I can get your name to add to my list of acknowledgment? – Viktor Oct 16 '19 at 15:46
  • @Viktor I'm doing great. Cole Miller is my full name and I'd very much appreciate an acknowledgement – Cole Oct 16 '19 at 22:47
  • 1
    I will send a copy of my acknowledgement to you – Viktor Oct 17 '19 at 02:21
0

To make Cole's written function sparse, I had to fix it using A[t, vec]<- 0.5 * Matrix::rowSums(cbind(A[vec,fam[t,"dad"]],A[vec,fam[t,"mum"]]), na.rm=T)

Thanks so far, we couldn't create submatrices but think we did better than that

Viktor
  • 47
  • 5