repeated joins on huge datasets

Question

Several matrices with 2 columns each need to be combined as shown below

matrix1
1,3
1,5
3,6

matrix2
1,4
1,5
3,6
3,7

output
1,3,1
1,4,1
1,5,2
3,6,2
3,7,1

The third column in the output is the count of how many times a pair has been seen in all the matrices. I wrote some code to do this

require(data.table)

set.seed(1000)
data.lst <- lapply(1:200, function(n) { x <- matrix(sample(1:1000,2000,replace=T), ncol=2); x[!duplicated(x),] })

#method 1
pair1.dt <- data.table(i=integer(0), j=integer(0), cnt=integer(0))
for(mat in data.lst) {
    pair1.dt <- rbind(pair1.dt, data.table(i=mat[,1],j=mat[,2],cnt=1))[, .(cnt=sum(cnt)), .(i,j)]
}

#method 2
pair2.dt <- data.table(i=integer(0), j=integer(0), cnt=integer(0))
for(mat in data.lst) {
    pair2.dt <- merge(pair2.dt, data.table(i=mat[,1],j=mat[,2],cnt=1), by=c("i","j"), all=T)[, 
        cnt:=rowSums(.SD,na.rm=T), .SDcols=c("cnt.x","cnt.y")][, c("cnt.x","cnt.y"):=NULL]
}

cat(sprintf("num.rows  =>  pair1: %d,  pair2: %d", pair1.dt[,.N], pair2.dt[,.N]), "\n")

In the real problem, each of the matrices have 10s of millions of rows and there may be 30-40% overlap. I am trying to figure out the fastest way to do this. I tried using Matrix::sparseMatrix. While that is much faster, I ran into an error "long vectors not supported yet". I have a couple of different data.table based approaches here. I am looking for suggestions to speed up this code and/or suggest alternative approaches.

For one thing, don't create `cnt` and sum it. Instead, try `rbindlist(lapply(data.lst, as.data.table))[, .N, by=V1:V2]` or similar. — Frank, Apr 18 '16 at 19:26
I don't think iteration is needed here, but it's not obvious to me what you're trying to achieve, so I could be wrong. — Frank, Apr 18 '16 at 19:30
the reason I have iteration here is that in the __real__ problem each matrix has 10s of millions of rows. So, I cannot save all those matrices. I need to update `output` each time a `matrix` is available. — ironv, Apr 18 '16 at 19:32
@Frank in your suggestion about `rbindlist` is the FULL data.table created before the agg is applied? If so, that will be too big and I will be out of memory. — ironv, Apr 18 '16 at 19:35
Do you know ahead of time what the range of pairs is, e.g., all pairs (x,y) both in 1:1000? — Frank, Apr 18 '16 at 19:37
I do. It is 1:2million for both. The data is VERY VERY VERY sparse. — ironv, Apr 18 '16 at 19:40
@ironv I'm a little confused - if you actually have a list of those matrices, then you *can* store all of that in memory and just do the `rbindlist` as Frank suggested..? — eddi, Apr 18 '16 at 20:14
I had to do that only for the purpose of creating an example for posting here. In the real problem, I have a for loop where I load a RData file, create the `matrix` with over 80million records, update `output` and continue (to next iteration). — ironv, Apr 18 '16 at 20:18
If you "create" the matrix, maybe you should instead create it as a data.table, since that coercion is likely not cheap. — Frank, Apr 18 '16 at 20:25
You always have 2 "integer" columns and `max` in each column is, at most, 2e6? Are there rows of a single "matrix" that are the same or each row in each "matrix" is distinct? — alexis_laz, Apr 18 '16 at 20:37
Not sure how feasible it actually is but you could try something along the following: (1) build a `Matrix::Matrix` with dimensions `max of all matrices * max of all matrices`: `n = 0L; for(mat in data.lst) n = max(n, max(mat)); m = Matrix(0L, n, n)` and (2) tabulate your pairs inside that "Matrix": `for(mat in data.lst) m[mat] = m[mat] + 1L`. Then, `summary(m)` should give the occurences of each combination. Of course, if in your matrices appear all 2e6 * 2e6 combinations, then you'll run out of memory. — alexis_laz, Apr 18 '16 at 21:01
@alexis_laz I have already tried this (please see the last paragraph in my original post. I was using Matrix::sparseMatrix and it was fast. But then I ran into this issue https://github.com/RcppCore/Rcpp/issues/337 I mentioned the error in the OP. — ironv, Apr 20 '16 at 01:11

Frank · Answer 1 · 2016-04-18T20:53:56.950

First, make data.tables of them:

dt.lst = lapply(data.lst, as.data.table)

Stacking. For comparison, here's the fast way that involves stacking:

res0 = rbindlist(dt.lst)[, .(n = .N), by=V1:V2]

The OP has said this is not feasible, since the intermediate result made by rbindlist would be too large.

Enumerating first. With a small range of values, I'd suggest enumerating them all up front:

res1 = CJ(V1 = 1:1000, V2 = 1:1000)[, n := 0L]
for (k in seq_along(dt.lst)) res1[ dt.lst[[k]], n := n + .N, by=.EACHI ] 

fsetequal(res0, res1[n>0]) # TRUE

The OP has indicated that there are 1e12 possible values, so this doesn't seem like a good idea. Instead, we can use

res2 = dt.lst[[1L]][0L]
for (k in seq_along(dt.lst)) res2 = funion(res2, dt.lst[[k]])
res2[, n := 0L]
setkey(res2, V1, V2)
for (k in seq_along(dt.lst)) res2[ dt.lst[[k]], n := n + .N, by=.EACHI ]     

fsetequal(res0, res2) # TRUE

This is the slowest of the three options for the example given, but seems best to me in light of the OP's concerns.

Growing inside a loop. Finally...

res3 = dt.lst[[1L]][0L][, n := NA_integer_][]
for (k in seq_along(dt.lst)){
  setkey(res3, V1, V2)
  res3[dt.lst[[k]], n := n + .N, by=.EACHI ]
  res3 = rbind(
    res3, 
    fsetdiff(dt.lst[[k]], res3[, !"n", with=FALSE], all=TRUE)[, .(n = .N), by=V1:V2]
  )
} 

fsetequal(res0, res3) # TRUE

Growing objects inside a loop is strongly discouraged and inefficient in R, but this allows you to do it in a single loop instead of two.

Other options and notes. I suspect that you'd be best of using a hash. Those are available in the hash package and probably also through the Rcpp package.

fsetequal, fsetdiff and funion are recent additions to the development version of the package. Find details on the data.table project's official site.

By the way, if entries within each matrix are distinct, you can replace .N with 1L everywhere above and drop by=.EACHI and all=TRUE.

`fsetequal` takes nrow into account already comparing sets duplicates-wise. — jangorecki, Apr 18 '16 at 20:39
@Frank I implemented your suggestion "Growing inside a loop" and the changes since the matrix entries are unique (last line in your post). In the real problem, by the 20th iteration (i.e. using 20th entry in dt.lst) it takes 159s for setkey, 6s to increment n, 240s for fsetdiff and 80s for the rbind. I was wondering about your suggestion regarding hash, would that be used with data.table? — ironv, Apr 20 '16 at 04:53
@ironv It wouldn't necessarily need to involve data.table. I think Scott's answer does what I had in mind. (I'm not familiar enough with Rcpp to say for sure.) If you tried the two-loop way (with `res2` in this answer), I wonder how that went. — Frank, Apr 20 '16 at 05:42

Scott Morken · Accepted Answer · 2016-04-18T21:59:37.503

3

Using Rcpp. This method will take advantage of the hashing property of std::unordered_map.

#include "Rcpp.h"
#include <stdint.h>
#include <unordered_map>

using namespace std;
using namespace Rcpp;

//[[Rcpp::export]]
Rcpp::XPtr<int> CreateMap(){
  std::unordered_map<int64_t, int>* myMap = new std::unordered_map<int64_t, int>();
  Rcpp::XPtr<int> p((int*)myMap,false);
  return p;
}

//[[Rcpp::export]]
void FreeMap(Rcpp::XPtr<int> map_ptr){
  std::unordered_map<int64_t, int>* myMap =  (std::unordered_map<int64_t, int>*)(int*)map_ptr;
  delete myMap;
}

//[[Rcpp::export]]
void AccumulateValues(Rcpp::XPtr<int> map_ptr, SEXP mat){

  NumericMatrix m(mat);

  std::unordered_map<int64_t, int>* myMap =  (std::unordered_map<int64_t, int>*)(int*)map_ptr;
  for(int i = 0; i<m.nrow(); i++){
    int c1 = m(i, 0);
    int c2 = m(i, 1);
    int64_t key = ((int64_t)c1 << 32) + c2;
    (*myMap)[key] ++;

  }
}
//[[Rcpp::export]]
SEXP AsMatrix(Rcpp::XPtr<int> map_ptr){
  std::unordered_map<int64_t, int>* myMap =  (std::unordered_map<int64_t, int>*)(int*)map_ptr;
  NumericMatrix m(myMap->size(),3);
  int index = 0;
  for ( auto it = myMap->begin(); it != myMap->end(); ++it ){
    int64_t key = it->first;
    m(index, 0) = (int)(key >> 32);
    m(index, 1) = (int)key;
    m(index, 2) = it->second;
    index++;
  }
  return m;
}

the R code is then:

myMap<-CreateMap()
AccumulateValues(myMap, matrix1)
AccumulateValues(myMap, matrix2)
result<-AsMatrix(myMap)
FreeMap(myMap)

also requires

PKG_CXXFLAGS = "-std=c++0x"

in the package makevars

edited Apr 18 '16 at 21:59

answered Apr 18 '16 at 21:06

Scott Morken

1,591
1
11
20

Thanks for your post. I have never used Rcpp and really do not know how to go about using the above code. Will look at examples and try this out. – ironv Apr 19 '16 at 23:08
1

The steps I would do (there are probably lots of ways) would be to 1. make sure you have R tools installed and configured correctly 2. install the Rcpp package 3. create a new project through R studio specifiying new "project with Rcpp" in the dialogs 4. in the project directory you now have a "src" dir with a "hello world" example, put the C++ source code you want to use there 5.since the above code uses unordered_map you need to support it with a compiler flag, so add a file named Makevars file to the "src" with the PKG_CXXFLAGS = "-std=c++0x" line added to it. – Scott Morken Apr 20 '16 at 16:32
Thanks for your post. I had just gotten started with Rcpp and tried the fibonacci example from Dirk's website. Amazing speedup! I am running this on a Linux box where I do not have root privs. I have compiled R and installed Rcpp. Do I add the Makevars file when compiling the package or just for the above code? – ironv Apr 20 '16 at 17:01
since the steps I outlined are a package based approach (Ie. you are creating and compliing your own "temp" R-package that is dedicated just to this code), you would just need to make sure that the Makevars file exist in your package's src dir when you compile your package (or any package for that matter where newer C++ features are used) If you use the inline Rcpp features rather than the package approach, see this post http://stackoverflow.com/questions/7063265/how-to-set-g-compiler-flags-using-rcpp-and-inline – Scott Morken Apr 20 '16 at 18:03
In an R console I used `Sys.setenv("PKG_CXXFLAGS"="-std=c++0x")` and then `sourceCpp` to compile and use the code. Each matrix had 80million rows and it took ~25s to `Accumulate`. Thank you for your help. Your answer not only helped me with this issue, but also helped me learn a very useful package! – ironv Apr 20 '16 at 22:15

score 2 · Answer 3 · answered Apr 18 '16 at 21:38

Perhaps you can process your data in batches, as your memory allows:

maxRows = 5000 # approximately how many rows can you hold in memory
tmp.lst = list()
nrows = 0
idx = 1
for (i in seq_along(data.lst)) {
  tmp.lst[[idx]] = as.data.table(data.lst[[i]])[, cnt := 1]
  idx = idx + 1
  nrows = nrows + nrow(data.lst[[i]])

  # if too many rows, collapse (can also replace by some memory condition)
  if (nrows > maxRows) {
    tmp.lst = list(rbindlist(tmp.lst)[, .(cnt = sum(cnt)), by = V1:V2])
    idx = 2
    nrows = nrow(tmp.lst[[1]])
  }
}

#final collapse
res = rbindlist(tmp.lst)[, .(cnt = sum(cnt)), by = V1:V2]

repeated joins on huge datasets

3 Answers3