Let's say I have the following dataset:
set.seed(42)
test <- data.frame(event_id = stringi::stri_rand_strings(1000, 2, '[A-Z]'), person_id = floor(runif(1000, min=0, max=500)))
>head(test)
event_id person_id
1 EP 438
2 IX 227
3 AV 212
4 GX 469
5 QF 193
6 MM 222
I want to transform this into an adjacency dataset where the rows and columns are the person_id and the values are the total number of event_ids those individuals appeared in.
I tried doing something like this:
adjacency_df <- test %>%
select('event_id', 'person_id') %>%
melt('event_id', value.name = 'invitee_id') %>%
dcast(invitee_id~invitee_id, fun.aggregate = n_distinct, value.var = 'event_id')
But upon trying to convert this to an adjacency matrix and then computing the total number of values that are non-zero that are not the diagonal terms like so:
#convert to a matrix, and rename rownames
adjacency_matrix <- as.matrix(sapply(adjacency_df[, -1], as.numeric))
rownames(adjacency_matrix) <- colnames(adjacency_matrix)
#identify if only the diagonal of the matrix is non-zero
all(adjacency_matrix[lower.tri(adjacency_matrix)] == 0, adjacency_matrix[upper.tri(adjacency_matrix)] == 0)
I get that all non-diagonal values are zero.
> all(adjacency_matrix[lower.tri(adjacency_matrix)] == 0, adjacency_matrix[upper.tri(adjacency_matrix)] == 0)
[1] TRUE
What is the most efficient way to do this (note the dataset contains 2 million observations)?
I have tried the technique suggested in the comments section and get the following error on my actual dataset:
adjacency_df <- crossprod(table(test)
Error in table(adjacency_df) :
attempt to make a table with >= 2^31 elements
So I need a better approach