2

I have a very large binary matrix, stored as a big.matrix to conserve memory (it is over 2 gb otherwise - 5 million columns and 100 rows).

r <- 100
c <- 10000
m4 <- matrix(sample(0:1,r*c, replace=TRUE),r,c)
m4 <- cbind(m4, 1)
m4 <- as.big.matrix(m4)

I need to remove every column which has only one unique value (in this case, only 0s or only 1s). Because of the number of columns, I want to be able to do this in parallel.

How can I accomplish this while keeping the data compressed as a big.matrix? I can convert it into a df and loop over the columns looking for the number of unique values, but this takes too much RAM.

Thanks!

Keshav M
  • 1,309
  • 1
  • 13
  • 24
  • Something like this ok? `m5 <- m4[, !(colSums(m4) %in% c(0, nrow(m4)))]` –  Apr 21 '18 at 20:11
  • Unfortunately not. First of all, it only works when m4 is of class `matrix`, not `big.matrix` as specified in the problem. Additionally, the output is of class `matrix` also, taking up too much memory. – Keshav M Apr 21 '18 at 20:15
  • Ah, ok, apologies. How about either wrapping the subset in `as.big.matrix`, or applying the subset operation over subsets of the big matrix using `sub.big.matrix? I guess you maybe already considered those options? –  Apr 21 '18 at 20:33
  • 1
    Use Rcpp to make an algorithm that returns the column indices you want to keep based on your criterion. This should be easy of you already know how to access elements of a `big.matrix` in Rcpp. Then, use `deepcopy`. – F. Privé Apr 21 '18 at 21:03

1 Answers1

3

Put that in an .cpp file and source it with Rcpp::sourceCpp:

// [[Rcpp::depends(BH, bigmemory)]]
#include <bigmemory/MatrixAccessor.hpp>
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
LogicalVector to_keep(SEXP bm_addr) {

  XPtr<BigMatrix> xptr(bm_addr);
  MatrixAccessor<double> macc(*xptr);

  size_t n = macc.nrow();
  size_t m = macc.ncol();

  double first_val;

  LogicalVector keep(m, false);

  for (size_t j = 0; j < m; j++) {
    first_val = macc[j][0];
    for (size_t i = 1; i < n; i++) {
      if (macc[j][i] != first_val) {
        keep[j] = true;
        break;
      }
    }
  }

  return keep;
}

/*** R
library(bigmemory)
r <- 100
c <- 10000
m4 <- matrix(sample(0:1,r*c, replace=TRUE),r,c)
m4 <- cbind(m4, 1)
m4 <- as.big.matrix(m4)
m4[, 1] <- 1
m4[, 2] <- 0

keep <- to_keep(m4@address)
m4.keep <- deepcopy(m4, cols = which(keep))
*/
F. Privé
  • 11,423
  • 2
  • 27
  • 78