How to calculate co-occurrence matrices based on large dataframes?

Question

I want to create a co-occurrence matrix based on the recommended code here (also see below). It works fine for most of the dataframes I work with. However, I get the following error messages for larger dataframes either if I use data.table::melt ...

negative length vectors are not allowed

... or later on using base::crossprod

error in crossprod: attempt to make a table with >=2^31 elements

Both are related to the size of the dataframe. In the first case, it relates to the number of rows, while in the latter case the size of the matrix exceeds the limit.

I'm aware about the solutions for the first issue (data.table::melt) proposed by [2], [3] and [4] as well as for the second issue (base::crossprod) by [5] and [6], and I've seen [7] but I'm not sure how to adapt them properly to my situation. I have tried to split the dataframe by ID into several dataframes, merge them and calculate the co-occurrence matrix but I've just produced additional error messages (e.g., cannot allocate vector of size 17.8 GB).

Reproducible Example

I have an assembled dataframe created by plyr::join that looks like this (but, of course, a lot larger):

df <- data.frame(ID = c(1,2,3,20000), 
                  C1 = c("England", "England", "England", "China"),
                  C2 = c("England", "China", "China", "England"),
                  C5850 = c("England", "China", "China", "England"),
                  SC1 = c("FOO", "BAR", "EAT", "FOO"),
                  SC2 = c("MERCI", "EAT", "EAT", "EAT"),
                  SC5850 = c("FOO", "MERCI", "FOO", "FOO"))

ID      C1      C2      ... C5850    SC1 SC2   ... SC5850
1       England England     England  FOO MERCI     FOO
2       England China       China    BAR EAT       MERCI
3       England China       China    EAT EAT       EAT
200000  China   England     England  FOO EAT       FOO

Original Code

colnames(df) <- c(paste0("SCCOUNTRY", 2:7))

library(data.table)

melt(setDT(df), id.vars = "ID", measure = patterns("^SCCOUNTRY"))[nchar(value) > 0 & complete.cases(value)] -> foo
unique(foo, by = c("ID", "value")) -> foo2
crossprod(table(foo2[, c(1,3)])) -> mymat
diag(mymat) <- ifelse(diag(mymat) <= 1, 0, mymat)

Conditions (for the calculation of the co-occurrence matrix)

Single observations without additional observations by ID/row are not considered, i.e. a row with only a single country once is counted as 0.
A combination/co-occurrence should be counted as 1.
Being in a combination results in counting as a self-combination as well (USA-USA), i.e. a value of 1 is assigned.
There is no value over 1 assigned to a combination by row/ID.

try passing in `na.rm=TRUE, variable.factor=FALSE` in your `melt` — chinsoon12, Jan 23 '20 at 22:10
Flodel's answer from [here](https://stackoverflow.com/questions/23035982/directly-creating-dummy-variable-set-in-a-sparse-matrix-in-r) might help. Depending on the sparsity if the terms it *may* ease the memory issues. — user20650, Jan 23 '20 at 22:41
... taking hint from there; `uniq = sort(as.character(unique(unlist(df[-1]))))` but may need to use `uniq=sort(as.character(unique(unlist(lapply(df[-1], unique)))))` due to vector length?? Then `i = rep(seq_len(length(df$ID)), ncol(df[-1])); j = unlist(lapply(df[-1], match, table=uniq)); m = sparseMatrix(i = i, j = j, x = 1, dimnames = list(df$ID, uniq)); crossprod(m)` — user20650, Jan 23 '20 at 22:41
@chinsoon12 - Thanks for your suggestion! It produces an error message ```Error: invalid subscript type 'list'``` — Seb, Jan 24 '20 at 12:18
@user20650 - Thank you! I assume that's the way to go. Due to vector length, I've had to use ```uniq=sort(as.character(unique(unlist(lapply(df[-1], unique)))))``` indeed. However, the calculation of ```j = unlist(lapply(df[-1], match, table=uniq))``` results in a crash of R and a re-boot of the OS (in this case Win10 64bit, 16GB RAM). I'll try to split the dataframe before the calculations and see if I can make it work. — Seb, Jan 24 '20 at 12:26
@Seb; what is the length of `uniq`? -- this will be the number of columns in the sparse matrix. But if ive calculated correctly your df is nearly 9GB (200000*5850*8/2^30) so your not left with much room to do stuff. I suppose you could try do this in chunks. — user20650, Jan 24 '20 at 12:52
@user20650 well, the length is 204788. I get the following error message cannot allocate vector of size 1024.0 Mb if I separate it in four parts - what confuses me a bit because 1024 Mb shouldn't be a problem, right? Anyway, thank you for your help! I assume that more chunks could do the trick. I'll see if I can make it work I guess — Seb, Jan 24 '20 at 16:35

How to calculate co-occurrence matrices based on large dataframes?

Reproducible Example

0 Answers0