Co-occurrences from a large dataframe

Question

I have a dataframe with information about which cities has visited each user:

df.visited <- data.frame(user  = c("john","john", 
                                   "claire", "claire", 
                                    "doe","doe"), 
                        city = c('Antananarivo', 'Barcelona', 
                                 'Caen', 'Dijon', 
                                 'Antananarivo', 'Caen'))

I want to create a graph of co-visits. For that, I need either an adjacency matrix (users x users) or an edge list (usera, userb, #co-visits)

I can do this for small datasets:

by_user_city <- table(df.visited)    

#        city
#user     Antananarivo Barcelona Caen Dijon
#claire            0         0    1     1
#doe               1         0    1     0
#john              1         1    0     0

adjacency <- by_user_city %*% t(by_user_city)

#     user 
#user     claire doe john
#claire      2   1    0
#doe         1   2    1
#john        0   1    2

edges <- melt(adjacency)

#    user   user value
#1 claire claire     2
#2    doe claire     1
#3   john claire     0
#4 claire    doe     1
#5    doe    doe     2
#6   john    doe     1
#7 claire   john     0
#8    doe   john     1
#9   john   john     2

For a large dataset with a log of 1.5M visits of more than 300,000 users, the table command complains:

Error in table(df.visited) : 
  attempt to make a table with >= 2^31 elements

So, how can I get the co-visit edges without running out of memory?

Perhaps, give a try to a sparse alternative -- `crossprod(sparseMatrix(i = as.integer(df.visited$city), j = as.integer(df.visited$user), x = 1L, dimnames = rev(sapply(df.visited, levels))))` — alexis_laz, Sep 12 '16 at 08:30

score 3 · Answer 1 · edited May 23 '17 at 10:27

Given the size of your data I suggest that you use the Java graph database neo4j. Former neo4j employee Nicole White made an R package for it, RNeo4j. I did this in 2014 to set up a lot of real time analytics on a very large company social network and it worked quite well.

You might also be able to make it work with some other graph database, but this is the one I know of and I think it's probably the most popular.

Here are the steps, as I see them:

Download Neo4j
install.packages("RNeo4j")
Connect: graph = startGraph("http://localhost:7474/db/data/")
Use the transactional endpoint to load the data
Query the results using Cypher

If you want more clarity on #4 and #5 there's an old post where someone asked how to scale up loading data into neo4j with R where White answered with examples of how to use the transactional endpoint and query the results. Of course, you could also load it outside of R if you wanted to.

This also solves the many future problems you may have with how to visualize the social graph, do all sorts of different queries your network/forum, deal with increasing size, etc, etc. You shouldn't run into memory problems this way as it's really well-designed for scale.

You can use graphing packages like igraph and ggnet with it, keeping the memory-intensive parts in the graph database:

library(igraph)

query = "
MATCH (n)-->(m)
RETURN n.name, m.name
"

edgelist = cypher(graph, query)
ig = graph.data.frame(edgelist, directed=F)

betweenness(ig)

plot(ig)

Thanks Hack-R. Yet I would like to have the graph in R (igraph) since I want to apply some algos such as community detection. — alberto, Sep 11 '16 at 21:54
@alberto There are instructions on how to use `igraph` with `Rneo4J` on the package page. It's like this `library(igraph); query = " MATCH (n)-->(m) RETURN n.name, m.name "; edgelist = cypher(graph, query); ig = graph.data.frame(edgelist, directed=F); betweenness(ig); plot(ig)` — Hack-R, Sep 11 '16 at 21:57

score 2 · Answer 2 · answered Sep 11 '16 at 21:37

2

Try this to avoid table function.

library(tidyr)
df.visited$val<-1
spread(df.visited,city,val,fill=0)

answered Sep 11 '16 at 21:37

Shenglin Chen

4,504
11
11

1

Thanks Shenglin. But unfortunately `Error: cannot allocate vector of size 288.3 Gb` I think we should avoid regular matrices. – alberto Sep 11 '16 at 21:47

score 2 · Answer 3 · answered Sep 11 '16 at 22:11

2

Speaking of igraph - maybe try:

library(igraph)
g <- graph_from_data_frame(df.visited)
V(g)$type <- bipartite.mapping(g)$type
g2 <- bipartite.projection(g)$proj1
as_data_frame(g2, "edges") %>% head
#     from  to weight
# 1   john doe      1
# 2 claire doe      1

g2 is the graph that you are probably (?) looking for.

answered Sep 11 '16 at 22:11

lukeA

53,097
5
97
100

Almost there: `At vector.pmt:439 : cannot reserve space for vector, Out of memory` but it looks like I might be able to do it freeing some memory or running it in a small server. I'll try it tomorrow, but I think I'll send you some beers. – alberto Sep 11 '16 at 22:25
btw the last line says `as.data.frame(g2, "edges") %>% head Error in as.data.frame.default(g2, "edges") : cannot coerce class ""igraph"" to a data.frame` Which igraph version are you using? – alberto Sep 11 '16 at 22:26
1

I'm using `packageVersion("igraph")` `‘1.0.1’`. Beers are always welcome :-) – lukeA Sep 11 '16 at 22:49

Co-occurrences from a large dataframe

3 Answers3

Linked