2

My data.frame is below:

group_id  user_id
1000       26
1236       29
1236       46
3767       26
3767       46
5614       29
5614       45
5614       46

I need output as below:

User-1   User-2  #of-common-groups
26       26       2
26       46       1
29       29       2
29       45       1
29       46       2
45       29       1
45       45       1
45       46       1
46       26       1
46       29       2    
46       45       1
46       46       3

Is there a quick way to achieve this? I actually have 137 distinct groups and about 81000 users.

The user 26 belongs to 2 groups and he also shares the same group 3767 with user 46. Hence

26 26 2
26 46 1
46 26 1
46 46 3  (user 46 belongs to 3 groups) etc
Viswa
  • 49
  • 4
  • How many rows does your real data have - i suspect my answer wont be very efficient if it is large - or crossprod wont work – user20650 Jan 03 '15 at 22:11
  • I actually have 127,838 rows of data. I have not used "graph" till now - maybe it is time to learn that now. – Viswa Jan 03 '15 at 22:25

4 Answers4

3
# your data
dat <- read.table(text="group_id  user_id
1000       26
1236       29
1236       46
3767       26
3767       46
5614       29
5614       45
5614       46", header=T)

# convert to matrix
m <- as.matrix(table(dat))

#calculate and rehape
mm <- crossprod(m,m)
r <- reshape2::melt(mm)

# remove where zero counts
r[r$value !=0 ,]
# user_id user_id value
# 1       26      26     2
# 4       46      26     1
# 6       29      29     2
# 7       45      29     1
# 8       46      29     2
# 10      29      45     1
# 11      45      45     1
# 12      46      45     1
# 13      26      46     1
# 14      29      46     2
# 15      45      46     1
# 16      46      46     3

EDIT: idea from Network: Making Graph Object from Event-Node Data Using igraph

g <- graph.data.frame(dat, directed = FALSE)

V(g)$type <- V(g)$name %in% unique(as.character(dat$group_id))

b <- bipartite.projection(g)$proj1

ad <- get.adjacency(b, sparse=F, attr="weight")
ad <- ad[sort(colnames(ad)), sort(colnames(ad))]

diag(ad) <- colSums(table(dat))

then continue as before
Community
  • 1
  • 1
user20650
  • 24,654
  • 5
  • 56
  • 91
  • Yours is a good solution, however, with 127000 rows, the crossprod runs a long time and my RStudio session freezes on a 16GB RAM Mac. Is there a more efficient way to do the crossprod? – Viswa Jan 03 '15 at 22:14
  • Ha, thought it might be a problem - see comment above. Probably a `graph` approach to this. – user20650 Jan 03 '15 at 22:15
3

what about:

df <- read.table(text="group_id  user_id
1000       26
1236       29
1236       46
3767       26
3767       46
5614       29
5614       45
5614       46", header=T)

df <- merge(df, df, by = "group_id")[,-1]
library(plyr)
ddply(df,.(user_id.x, user_id.y),nrow)

   user_id.x user_id.y V1
1         26        26  2
2         26        46  1
3         29        29  2
4         29        45  1
5         29        46  2
6         45        29  1
7         45        45  1
8         45        46  1
9         46        26  1
10        46        29  2
11        46        45  1
12        46        46  3

EDIT: I was afraid this was to simple in this case, because of the merge having a "large" number of users and small number of groups. Depending of course on the final use, I would consider a graph structure as already suggested by user20650 and probably keep it as such. An undirected weighted graph with a quick look up for the vertices (user.id) seems like a good solution in many cases.

I'll leave this simple method in case of a smaller dataset (or just less overlap).

J.R.
  • 3,838
  • 1
  • 21
  • 25
  • That's a neat solution! (+1) I would just consider adding the same approach using either dplyr or data.table for better performance. For example in dplyr: `df %>% left_join(., . ,by = "group_id") %>% select(-group_id) %>% count(user_id.x, user_id.y)` – talat Jan 03 '15 at 23:46
  • JR - this does not work as well. I get the error "Error in vecseq(f__, len__, if (allow.cartesian || notjoin) NULL else as.integer(max(nrow(x), : Join results in 560022260 rows; more than 127828 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the large allocation...; I then added allow.cartesian = TRUE but this returns nothing. – Viswa Jan 04 '15 at 00:32
  • Docendo - what does %>% do? – Viswa Jan 04 '15 at 00:54
  • @Viswa, that is a pipe operator used in the package **dplyr** and some others. To run the code, you have to install and load the package. You can do: `install.packages("dplyr"); load(dplyr)` and then the code from above. – talat Jan 04 '15 at 01:04
3

Here's an attempt using the Matrix package - just replicating @nograpes' answer from here:

require(Matrix)
sm = sparseMatrix(dat$group_id, dat$user_id, x = TRUE)
cp = t(sm) %*% sm 
as.data.frame(summary(cp))
#     i  j x
# 1  26 26 2
# 2  46 26 1
# 3  29 29 2
# 4  45 29 1
# 5  46 29 2
# 6  29 45 1
# 7  45 45 1
# 8  46 45 1
# 9  26 46 1
# 10 29 46 2
# 11 45 46 1
# 12 46 46 3
Community
  • 1
  • 1
Arun
  • 116,683
  • 26
  • 284
  • 387
2

So here are two approaches, one using data.table...

library(data.table)
setkey(setDT(df),group_id)
df[df,allow.cartesian=TRUE][,.N,by=list(user_id,i.user_id)][order(user_id,i.user_id)]
#     user_id i.user_id N
#  1:      26        26 2
#  2:      26        46 1
#  3:      29        29 2
#  4:      29        45 1
#  5:      29        46 2
#  6:      45        29 1
#  7:      45        45 1
#  8:      45        46 1
#  9:      46        26 1
# 10:      46        29 2
# 11:      46        45 1
# 12:      46        46 3

and one using sqldf...

library(sqldf)
sqldf("select a.user_id as user1, b.user_id as user2, count(*) as groups
      from df a inner join df b on a.group_id=b.group_id
      group by 1,2 order by 1,2")
#    user1 user2 groups
# 1     26    26      2
# 2     26    46      1
# 3     29    29      2
# 4     29    45      1
# 5     29    46      2
# 6     45    29      1
# 7     45    45      1
# 8     45    46      1
# 9     46    26      1
# 10    46    29      2
# 11    46    45      1
# 12    46    46      3

The data.table approach is likely to be faster, but your dataset is not very large so it might not make that much of a difference.

jlhoward
  • 58,004
  • 7
  • 97
  • 140
  • Your data.table solution works but it took ages on my dataset of 127828 rows. – Viswa Jan 04 '15 at 01:29
  • I think [this answer](http://stackoverflow.com/a/26246588/559784) might perform better (for data.table), but the Matrix package seems best fit for this job to me. – Arun Jan 04 '15 at 22:26