Several matrices with 2 columns each need to be combined as shown below
matrix1
1,3
1,5
3,6
matrix2
1,4
1,5
3,6
3,7
output
1,3,1
1,4,1
1,5,2
3,6,2
3,7,1
The third column in the output is the count of how many times a pair has been seen in all the matrices. I wrote some code to do this
require(data.table)
set.seed(1000)
data.lst <- lapply(1:200, function(n) { x <- matrix(sample(1:1000,2000,replace=T), ncol=2); x[!duplicated(x),] })
#method 1
pair1.dt <- data.table(i=integer(0), j=integer(0), cnt=integer(0))
for(mat in data.lst) {
pair1.dt <- rbind(pair1.dt, data.table(i=mat[,1],j=mat[,2],cnt=1))[, .(cnt=sum(cnt)), .(i,j)]
}
#method 2
pair2.dt <- data.table(i=integer(0), j=integer(0), cnt=integer(0))
for(mat in data.lst) {
pair2.dt <- merge(pair2.dt, data.table(i=mat[,1],j=mat[,2],cnt=1), by=c("i","j"), all=T)[,
cnt:=rowSums(.SD,na.rm=T), .SDcols=c("cnt.x","cnt.y")][, c("cnt.x","cnt.y"):=NULL]
}
cat(sprintf("num.rows => pair1: %d, pair2: %d", pair1.dt[,.N], pair2.dt[,.N]), "\n")
In the real problem, each of the matrices have 10s of millions of rows and there may be 30-40% overlap. I am trying to figure out the fastest way to do this. I tried using Matrix::sparseMatrix. While that is much faster, I ran into an error "long vectors not supported yet". I have a couple of different data.table based approaches here. I am looking for suggestions to speed up this code and/or suggest alternative approaches.