0

I am trying to subtract each column from each other column in a large R data.table, that has 13125 columns and 90 rows.

I am following up on a previous question which addresses this for data.tables of smaller dimensions (Subtract every column from each other column in a R data.table).

My problem is that I am currently running out of memory to generate the resulting data.table of column combinations (which seems to require 59.0GB).

My question is, is there a more memory-efficient way to calculate the column differences with combn or perhaps another function for larger datasets?

The code I have been using is:

# I have a data.table of 13125 columns and 90 rows, called data. 

# use combn to generate all possible pairwise column combinations (column + column),
# then within this apply a function to subtract the column value from its paired column value.
# this is done for each row, to produce a new datatable called res.

res <- as.data.table(combn(colnames(data), 2, function(x) data[[x[1]]] - data[[x[2]]]))

# take the pairwise column combinations and paste the pairing as the new column name

colnames(res) <- combn(colnames(data), 2, paste, collapse="_")

I apologise if this question is too similar and therefore considered a duplication. I would be very grateful for any advice with how to improve the efficiency of this code for the scale of my data.

fields
  • 1
  • 4
    You are asking for A LOT (about 86126250) of pairwise comparisons. Are you sure this is the right thing to do? – Roman Luštrik Feb 15 '19 at 15:57
  • The number of pairwise combinations is defined as `n!/(k!(n-k)!)`. In this example, R cannot even calculate that, since `factorial(13125)` errors with `value out of range in 'gammafn'`. So I second (strongly) @RomanLuštrik's assertion (I had to go to a [bit-number combinatorial calculator](https://www.numberempire.com/combinatorialcalculator.php) to confirm his number). Are you certain this is really what you need? Is there any way to reduce the pairings, heuristically, so that you still find the "significance" you're seeking? – r2evans Feb 15 '19 at 16:49
  • @r2evans I just did `dim(combn(1:13125, 2))` and waited a bit. – Roman Luštrik Feb 16 '19 at 19:26
  • @r2evans and Roman Luštrik, thank you very much for your responses. Unfortunately, I do require this number of comparisons. For more context, my dataset contains gene expression data (13125 genes as columns, and 90 samples as rows). My motivation is to calculate pairwise gene expression deviance across all samples. My dataset also contains missing values and I haven't been able to find a package that can handle this, and so I am hoping to calculate it manually. My pipeline is to then square these comparison values, and sum the column totals to calculate deviance between gene pairs. – fields Feb 18 '19 at 09:40

1 Answers1

0

As per OP's comment regarding the next step after differencing columns, it will be more memory compact if you also square and sum the column totals during the calculation so that you will only have a vector with 13,125 elements as a result rather than storing 13,125*90*90 numeric subtracted values. A fast and possible approach is to use RcppArmadillo:

colpairs.cpp (by no means the only implementation):

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace Rcpp;
using namespace arma;

// [[Rcpp::export]]
rowvec colpairs(mat Z) {
    unsigned int i, j, k = Z.n_cols;
    colvec vi, vj, y;
    rowvec res(k);

    for (i=0; i<k; i++) {
        vi = Z.col(i);
        res[i] = 0;
        for (j=0; j<k; j++) {
            vj = Z.col(j);
            y = vi - vj;
            res[i] += as_scalar(y.t() * y);
        }
    }

    return res;
}

In R:

library(Rcpp)
library(RcppArmadillo)
sourceCpp("colpairs.cpp")

# #use a small matrix to check results
# set.seed(0L)
# nc <- 3; nr <- 3; M <- matrix(rnorm(nr*nc), ncol=nc)
# c(sum((M[,1]-M[,2])^2 + (M[,1]-M[,3])^2), sum((M[,3]-M[,2])^2 + (M[,2]-M[,3])^2), sum((M[,3]-M[,1])^2 + (M[,2]-M[,3])^2))
# colpairs(M)

set.seed(0L)
nc <- 13125
nr <- 90
M <- matrix(rnorm(nr*nc), ncol=nc)
colpairs(M)

trunc. output:

[1] 2105845 2303591 2480945 2052415 2743199 2475948 2195874 2122436 2317515  .....
chinsoon12
  • 25,005
  • 4
  • 25
  • 35