I want to create a co-occurrence matrix based on the recommended code here (also see below). It works fine for most of the dataframes I work with. However, I get the following error messages for larger dataframes either if I use data.table::melt
...
negative length vectors are not allowed
... or later on using base::crossprod
error in crossprod: attempt to make a table with >=2^31 elements
Both are related to the size of the dataframe. In the first case, it relates to the number of rows, while in the latter case the size of the matrix exceeds the limit.
I'm aware about the solutions for the first issue (data.table::melt
) proposed by [2], [3] and [4] as well as for the second issue (base::crossprod
) by [5] and [6], and I've seen [7] but I'm not sure how to adapt them properly to my situation. I have tried to split the dataframe by ID into several dataframes, merge them and calculate the co-occurrence matrix but I've just produced additional error messages (e.g., cannot allocate vector of size 17.8 GB).
Reproducible Example
I have an assembled dataframe created by plyr::join
that looks like this (but, of course, a lot larger):
df <- data.frame(ID = c(1,2,3,20000),
C1 = c("England", "England", "England", "China"),
C2 = c("England", "China", "China", "England"),
C5850 = c("England", "China", "China", "England"),
SC1 = c("FOO", "BAR", "EAT", "FOO"),
SC2 = c("MERCI", "EAT", "EAT", "EAT"),
SC5850 = c("FOO", "MERCI", "FOO", "FOO"))
ID C1 C2 ... C5850 SC1 SC2 ... SC5850
1 England England England FOO MERCI FOO
2 England China China BAR EAT MERCI
3 England China China EAT EAT EAT
200000 China England England FOO EAT FOO
Original Code
colnames(df) <- c(paste0("SCCOUNTRY", 2:7))
library(data.table)
melt(setDT(df), id.vars = "ID", measure = patterns("^SCCOUNTRY"))[nchar(value) > 0 & complete.cases(value)] -> foo
unique(foo, by = c("ID", "value")) -> foo2
crossprod(table(foo2[, c(1,3)])) -> mymat
diag(mymat) <- ifelse(diag(mymat) <= 1, 0, mymat)
Conditions (for the calculation of the co-occurrence matrix)
- Single observations without additional observations by ID/row are not considered, i.e. a row with only a single country once is counted as 0.
- A combination/co-occurrence should be counted as 1.
- Being in a combination results in counting as a self-combination as well (USA-USA), i.e. a value of 1 is assigned.
- There is no value over 1 assigned to a combination by row/ID.