How to Efficiently work with Sparse / "Long format" data matrix in R

Question

EDIT: I found out that the Matrix package does everything I need. Super fast and flexible. Specifically, the related functions are

Data <- sparseMatrix(i=Data[,1], j=Data[,2], x=Data[,3])

or simply

Data <- Matrix(data=Data,sparse=T)

Once you have your matrix in this Matrix class, everything should work smoothly like a regular matrix (for the most part, anyway).

======================================================

I have a dataset in "Long format" right now, meaning that it has 3 columns: row name, column name, and value. All of the "missing" row-column pairs are equal to zero.

I need to come up with an efficient way to calculate the cosine similarity (or even just the regular dot product) between all possible pairs of rows. The full data matrix is 19000 x 62000, which is why I need to work with the Long format instead.

I came up with the following method, but it's WAY too slow. Any tips on maximizing efficiency, or any suggestions of a better method overall, would be GREATLY appreciated. Thanks!

Data <- matrix(c(1,1,1,2,2,2,3,3,3,1,2,3,1,2,4,1,4,5,1,2,2,1,1,1,1,3,1), 
ncol = 3, byrow = FALSE)
Data <- data.frame(Data)

cosine.sparse <- function(data) {

a <- Sys.time()

colnames(data) <- c('V1', 'V2', 'V3')
nvars <- length(unique(data[,2]))
nrows <- length(unique(data[,1]))

sim <- matrix(nrow=nrows, ncol=nrows)

for (i in 1:nrows) {

    data.i <- data[data$V1==i,]

    length.i.sq <- sum(data.i$V3^2)

    for (j in i:nrows) {

        data.j <- data[data$V1==j,]
        length.j.sq <- sum(data.j$V3^2)

        common.vars <- intersect(data.i$V2, data.j$V2)

        row1 <- data.i[data.i$V2 %in% common.vars,3]
        row2 <- data.j[data.j$V2 %in% common.vars,3]

        cos.sim <- sum(row1*row2)/sqrt(length.i.sq*length.j.sq)

        sim[i,j] <- sim[j,i] <- cos.sim

    }

    if (i %% 500 == 0) {cat(i, " rows have been calculated.")}
}

b <- Sys.time()
time.elapsed <- b - a
print(time.elapsed)

return(sim)
}

cosine.sparse(Data2)

If I understand apply correctly, it's used to apply functions to margins of the data, right? I'm not sure how I can use that to specify pairwise operations between values of "V2" in my example. I could make it work if I transformed the data to wide format, but that defeats the purpose. Any thoughts/suggestions? — J.F., Apr 11 '17 at 15:01
Actually you may want to look here http://stackoverflow.com/questions/13281303/creating-co-occurrence-matrix/24627329#24627329 or here http://stackoverflow.com/questions/2535234/find-cosine-similarity-between-two-arrays — B Williams, Apr 11 '17 at 16:49
The "aggregating sparse data" link is exactly my question! Buuut all the answers basically convert the data to long format (the full nxp matrix). My matrix is too large for that unfortunately. I might have to give up and try it in Python instead. Thanks anyway! — J.F., Apr 11 '17 at 18:43

How to Efficiently work with Sparse / "Long format" data matrix in R

0 Answers0