0

I have this big binary data.table:

> str(mat)
Classes 'data.table' and 'data.frame':  262561 obs. of  10615 variables:
 $ 1001682: num  0 0 0 0 0 0 0 0 0 0 ...
 $ 1001990: num  0 0 0 0 0 0 0 0 0 0 ...
 $ 1002541: num  0 0 0 0 0 0 0 0 0 0 ...
 $ 1002790: num  0 0 0 0 0 0 0 0 0 0 ...
 $ 1003312: num  0 0 0 0 0 0 0 0 0 0 ...
 $ 1004403: num  0 0 0 0 0 0 0 0 0 0 ...

There are somewhere 1 (it's not full of zeros). And I'm trying to convert it to data.matrix by just writing mat <- data.matrix(mat) but R session always abort. Is it a problem with my computer? Should I try some high performance computer? Or is there some other way to do this? I need it in data.matrix.

I'm using macbook pro early 2015 with 2.7 GHz Intel Core i5 and 8Gm DDR3.

  • "I need it in data.matrix." Really? It rather looks like you should make it a sparse matrix (see package Matrix). – Roland Aug 24 '17 at 10:30
  • 8 * 262561 * 10615 / 1024^3 ~ 21GB. How can you have this in memory with only 8GB? Does `data.table` already know it is sparse? Or is it some lazyness of R? – F. Privé Aug 24 '17 at 10:48
  • @Roland This really looks like sparse matrix, but it's because of data which I use. I'm trying to write general code so the `mat` should sometimes contain more 1 than 0. – Martina Zapletalová Aug 24 '17 at 10:59
  • That wouldn't be a problem for a sparse matrix structure ... – Roland Aug 24 '17 at 11:00
  • @Roland ok, but how should I make sparse matrix from data.table? I need to have the values "1001682", "1001990" etc. in colnames. I looked [here](https://stackoverflow.com/questions/26207850/create-sparse-matrix-from-a-data-frame) for some hint but it didn't help. I can store `colnames(mat)` and just use values from `mat`, but I don't understand well the help for `sparseMatrix` – Martina Zapletalová Aug 24 '17 at 11:23

2 Answers2

1

Here is how you can convert the data.table to a sparse matrix:

library(data.table)
library(Matrix)
DT <- fread("A B C D E
            0 1 0 1 0
            1 0 0 0 0
            1 1 1 0 1")

ncol <- length(DT)
nrow <- nrow(DT)
dimnames <- names(DT)

DT <- melt(DT)
inds <- DT[, which(as.logical(value))]
i <- (inds -1) %% nrow + 1
j <- (inds - 1) %/% nrow + 1

DT <- DT[value == 1]
DT <- sparseMatrix(i = i, j = j, x = TRUE, dims = c(nrow, ncol), dimnames = list(NULL, dimnames))
#3 x 5 sparse Matrix of class "lgCMatrix"
#     A B C D E
#[1,] . | . | .
#[2,] | . . . .
#[3,] | | | . |

It is unclear what you want to do with the data, but a sparse matrix is the most memory efficient data structure here. Of course, functions you plan to use must be able to deal with such a structure.

Edit:

OP wants to calculate the cosine similarity.

library(qlcMatrix)
cosSparse(DT)
#5 x 5 sparse Matrix of class "dsCMatrix"
#          A         B         C         D         E
#A 1.0000000 0.5000000 0.7071068 .         0.7071068
#B 0.5000000 1.0000000 0.7071068 0.7071068 0.7071068
#C 0.7071068 0.7071068 1.0000000 .         1.0000000
#D .         0.7071068 .         1.0000000 .        
#E 0.7071068 0.7071068 1.0000000 .         1.0000000
Roland
  • 127,288
  • 10
  • 191
  • 288
  • so the output will really look like dots and lines? .... after I get my data in form of some matrix I need to compute cosine similarity, that's why I need it in form of data.matrix (or something similar) so I'm not sure this sparse matrix is right input for `cosine` function. (maybe I should mentioned it before) – Martina Zapletalová Aug 24 '17 at 16:42
  • Please don't confuse print output with the internal structure. Anyway, see https://www.rdocumentation.org/packages/qlcMatrix/versions/0.9.5/topics/cosSparse – Roland Aug 24 '17 at 17:55
  • I just have a time to try it and I have one question: Is there any way how to transform this similarity sparse matrix back to ordinary matrix which will look exacatly like your print output but instead of . there will be 0. – Martina Zapletalová Aug 30 '17 at 12:52
  • simply use as.matrix, but that will increase memory demand, possibly strongly increase it. – Roland Aug 30 '17 at 13:03
  • I was just writing here: when I try as.matrix the R quit with no warning or "R session aborted" – Martina Zapletalová Aug 30 '17 at 13:05
  • I realize that I will probably use the sparse matrix and I will not try transform it back. – Martina Zapletalová Aug 30 '17 at 13:28
  • Why do you care how zero is plotted anyway? – Roland Aug 30 '17 at 13:29
  • That isn't the problem. I was running some for loop and it didn't work so I was thinking it was because I originaly use ordinary matrix in this for loop. But then I realized that I don't use matrix but data frame, so I can't call the rows from sparse matrix like [[..]]. – Martina Zapletalová Aug 30 '17 at 13:34
0

I am not sure if this is more efficient that Roland's method, but it produces the same matrix and does not require a reshape of the data. It does require essentially the same lapply as in my previous answer to the OP, which she described as slow. Using the data.table constructed in Roland's answer.

library(Matrix)

# get positions of non-zero values in data.table
myRows <- lapply(DT, function(x) which(x != 0))

# build sparse matrix
DT <- sparseMatrix(i = unlist(myRows), # row positions of non zero values
                   j = rep(seq_along(myRows), lengths(myRows)), # column positions
                   dims = c(nrow(DT), ncol(DT))) # dimension of matrix

This returns

DT
3 x 5 sparse Matrix of class "lgCMatrix"

[1,] . | . | .
[2,] | . . . .
[3,] | | | . |
lmo
  • 37,904
  • 9
  • 56
  • 69