2

Let's say I have the following dataset:

set.seed(42)
test <- data.frame(event_id = stringi::stri_rand_strings(1000, 2, '[A-Z]'), person_id = floor(runif(1000, min=0, max=500)))

>head(test)
  event_id person_id
1       EP       438
2       IX       227
3       AV       212
4       GX       469
5       QF       193
6       MM       222

I want to transform this into an adjacency dataset where the rows and columns are the person_id and the values are the total number of event_ids those individuals appeared in.

I tried doing something like this:

adjacency_df <- test %>%
  select('event_id', 'person_id') %>%
  melt('event_id', value.name = 'invitee_id') %>%
  dcast(invitee_id~invitee_id, fun.aggregate = n_distinct, value.var = 'event_id')

But upon trying to convert this to an adjacency matrix and then computing the total number of values that are non-zero that are not the diagonal terms like so:

#convert to a matrix, and rename rownames
adjacency_matrix <- as.matrix(sapply(adjacency_df[, -1], as.numeric))  
rownames(adjacency_matrix) <- colnames(adjacency_matrix)

#identify if only the diagonal of the matrix is non-zero
all(adjacency_matrix[lower.tri(adjacency_matrix)] == 0, adjacency_matrix[upper.tri(adjacency_matrix)] == 0)

I get that all non-diagonal values are zero.

> all(adjacency_matrix[lower.tri(adjacency_matrix)] == 0, adjacency_matrix[upper.tri(adjacency_matrix)] == 0)
[1] TRUE

What is the most efficient way to do this (note the dataset contains 2 million observations)?

I have tried the technique suggested in the comments section and get the following error on my actual dataset:

adjacency_df <- crossprod(table(test)
Error in table(adjacency_df) : 
  attempt to make a table with >= 2^31 elements

So I need a better approach

Parseltongue
  • 11,157
  • 30
  • 95
  • 160
  • 4
    Have a look at this question: https://stackoverflow.com/questions/13281303/creating-co-occurrence-matrix. Answer by A5C1D2H2I1M1N2O1R2T1 mentions `crossprod(table(df))` – Lamia Aug 09 '18 at 18:16
  • Please see edit. I have tried the crossprod approach, but it was ineffective. – Parseltongue Aug 09 '18 at 20:34
  • 1
    Does the `igraph` library do what you need? I.e., something like `library(igraph) ; g <- graph_from_edgelist(as.matrix(test), directed = F) ; V(g)$type <- V(g)$name %in% test$event_id ; as_adj(bipartite_projection(g, which = "false"))` – Jake Fisher Aug 09 '18 at 20:52

2 Answers2

2

Since matrix size seems to be the issue, you can do this using the Matrix version of crossprod, as follows:

library(Matrix)

mat <- with(
  test,
  sparseMatrix(
    i = as.numeric(factor(event_id)),
    j = as.numeric(factor(person_id)),
    dimnames = list(levels(factor(event_id)), levels(factor(person_id)))
  )
)

crossprod(mat)

The Matrix package creates sparse matrices, so it should be able to handle larger numbers of cells.

Jake Fisher
  • 3,220
  • 3
  • 26
  • 39
  • Thank you so much. I've been working on this problem for one week -- both this and the comment you submitted work. The igraph solution is nice because I can conveniently generate an edgelist from it. Is there an easy way to produce edgelists from sparse matrices? – Parseltongue Aug 09 '18 at 22:02
  • 1
    Glad this was helpful. It looks like `summary(mat)` will do it: https://stackoverflow.com/questions/15849641/how-to-convert-a-sparse-matrix-into-a-matrix-of-index-and-value-of-non-zero-elem – Jake Fisher Aug 09 '18 at 22:14
  • do you happen how to turn the resulting adjacency matrix into a graph object that can be interpreted in the igraph library? I've been trying for an hour, and can't seem to figure out how to get it to read in the adjacency matrix produced through this. @jake fisher – Parseltongue Aug 20 '18 at 01:33
  • 1
    @Parseltongue To do that, I would just stick with the approach I mentioned in my comment on your question. In the code there, `g` was the igraph graph object. If you want the bipartite projection, you'd do something like `bp <- bipartite_projection(g, which = "false")`. – Jake Fisher Aug 20 '18 at 19:24
1

Not sure if this will solve your error with crossprod- but maybe try like this. Data as above:

library(dplyr)

 set.seed(42)
  test <-
    data.frame(
      event_id = stringi::stri_rand_strings(1000, 2, '[A-Z]'),
      person_id = floor(runif(1000, min = 0, max = 500))
    )

Group by event_id and make a table from that:

out <- test %>%
  group_by(event_id) %>%
  table() 

Use that grouped output as input for crossprod:

x <- crossprod(out)

Have a look at a small portion of that large matrix:

> x[1:20, 1:20]
         person_id
person_id 0 2 3 4 5 6 9 10 11 12 13 14 15 16 17 18 19 20 21 23
       0  1 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0
       2  0 5 0 0 0 0 0  0  0  0  0  0  0  1  0  0  0  0  0  0
       3  0 0 4 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0
       4  0 0 0 3 0 0 0  0  0  0  1  0  0  0  0  0  0  0  0  0
       5  0 0 0 0 1 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0
       6  0 0 0 0 0 1 0  0  0  0  0  0  0  0  0  0  0  0  0  0
       9  0 0 0 0 0 0 3  0  0  0  0  0  0  0  0  0  0  0  0  0
       10 0 0 0 0 0 0 0  4  0  0  0  0  0  0  0  0  0  0  0  0
       11 0 0 0 0 0 0 0  0  1  0  0  0  0  0  0  0  0  0  0  0
       12 0 0 0 0 0 0 0  0  0  2  0  0  0  0  0  0  0  0  0  0
       13 0 0 0 1 0 0 0  0  0  0  2  0  0  0  0  0  0  0  0  0
       14 0 0 0 0 0 0 0  0  0  0  0  3  0  0  0  0  0  0  0  0
       15 0 0 0 0 0 0 0  0  0  0  0  0  1  0  0  0  0  0  0  0
       16 0 1 0 0 0 0 0  0  0  0  0  0  0  3  0  0  0  0  0  0
       17 0 0 0 0 0 0 0  0  0  0  0  0  0  0  1  0  0  0  0  0
       18 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  5  0  0  0  0
       19 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  3  0  0  0
       20 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  3  0  0
       21 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  2  0
       23 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  3

Is that close to the output you're expecting? It's kind of hard to tell if it's working- maybe have a look at this smaller example dataset:

{
  set.seed(42)
  test <-
    data.frame(
      event_id = sample(c("AB", "LM", "YZ"), size = 10, replace = TRUE),
      person_id = 1:10
    )
  out <- test %>%
    group_by(event_id) %>%
    table() 
  x <- crossprod(out)
  print(out)
  x
}

        person_id
event_id 1 2 3 4 5 6 7 8 9 10
      AB 0 0 1 0 0 0 0 1 0  0
      LM 0 0 0 0 1 1 0 0 1  0
      YZ 1 1 0 1 0 0 1 0 0  1
         person_id
person_id 1 2 3 4 5 6 7 8 9 10
       1  1 1 0 1 0 0 1 0 0  1
       2  1 1 0 1 0 0 1 0 0  1
       3  0 0 1 0 0 0 0 1 0  0
       4  1 1 0 1 0 0 1 0 0  1
       5  0 0 0 0 1 1 0 0 1  0
       6  0 0 0 0 1 1 0 0 1  0
       7  1 1 0 1 0 0 1 0 0  1
       8  0 0 1 0 0 0 0 1 0  0
       9  0 0 0 0 1 1 0 0 1  0
       10 1 1 0 1 0 0 1 0 0  1
Luke C
  • 10,081
  • 1
  • 14
  • 21
  • Thanks for the idea. Unfortunately, I get the same memory error: `> out <- calendar_subset %>% + select('meeting_id', 'invitee_id') %>% + group_by('meeting_id') %>% + table() Error in table(.) : attempt to make a table with >= 2^31 elements` – Parseltongue Aug 09 '18 at 22:01
  • 1
    @Parseltongue - Fair enough, not surprising really! Hopefully Jake's answer sorts you out completely. – Luke C Aug 09 '18 at 22:03
  • 1
    Really appreciate you taking the time, though! – Parseltongue Aug 09 '18 at 22:04