I have a dataframe with information about which cities has visited each user:
df.visited <- data.frame(user = c("john","john",
"claire", "claire",
"doe","doe"),
city = c('Antananarivo', 'Barcelona',
'Caen', 'Dijon',
'Antananarivo', 'Caen'))
I want to create a graph of co-visits. For that, I need either an adjacency matrix (users x users) or an edge list (usera, userb, #co-visits)
I can do this for small datasets:
by_user_city <- table(df.visited)
# city
#user Antananarivo Barcelona Caen Dijon
#claire 0 0 1 1
#doe 1 0 1 0
#john 1 1 0 0
adjacency <- by_user_city %*% t(by_user_city)
# user
#user claire doe john
#claire 2 1 0
#doe 1 2 1
#john 0 1 2
edges <- melt(adjacency)
# user user value
#1 claire claire 2
#2 doe claire 1
#3 john claire 0
#4 claire doe 1
#5 doe doe 2
#6 john doe 1
#7 claire john 0
#8 doe john 1
#9 john john 2
For a large dataset with a log of 1.5M visits of more than 300,000 users, the table command complains:
Error in table(df.visited) :
attempt to make a table with >= 2^31 elements
So, how can I get the co-visit edges without running out of memory?