Get number of same individuals for different groups

Question

I have a data set with individuals (ID) that can be part of more than one group.

Example:

library(data.table)
DT <- data.table(
  ID = rep(1:5, c(3:1, 2:3)),
  Group = c("A", "B", "C", "B",
            "C", "A", "A", "C",
            "A", "B", "C")
)
DT
#     ID Group
#  1:  1     A
#  2:  1     B
#  3:  1     C
#  4:  2     B
#  5:  2     C
#  6:  3     A
#  7:  4     A
#  8:  4     C
#  9:  5     A
# 10:  5     B
# 11:  5     C

I want to know the sum of identical individuals for 2 groups.

The result should look like this:

  Group.1    Group.2    Sum
    A           B        2
    A           C        3
    B           C        3

Where Sum indicates the number of individuals the two groups have in common.

See, also, this [post](http://stackoverflow.com/questions/19891278/r-table-of-interactions-case-with-pets-and-houses). You can combine that answer with `as.data.frame(as.table())` and `complete.cases`. — alexis_laz, Nov 09 '15 at 17:02
`out <- with(dt, crossprod(table(ID, Group))); out[lower.tri(out, diag=TRUE)] <- NA; na.omit(as.data.frame.table(out))` from the link above — rawr, Nov 09 '15 at 17:12

MichaelChirico · Accepted Answer · 2017-12-23T12:53:56.163

4

Here's my version:

# size-1 IDs can't contribute; skip
DT[ , if (.N > 1) 
  # simplify = FALSE returns a list;
  #   transpose turns the 3-length list of 2-length vectors
  #   into a length-2 list of 3-length vectors (efficiently)
  transpose(combn(Group, 2L, simplify = FALSE)), by = ID
  ][ , .(Sum = .N), keyby = .(Group.1 = V1, Group.2 = V2)]

With output:

#    Group.1 Group.2 Sum
# 1:       A       B   2
# 2:       A       C   3
# 3:       B       C   3

edited Dec 23 '17 at 12:53

answered Nov 09 '15 at 16:50

MichaelChirico

33,841
14
113
198

Alternatively, you can use `DT[ , as.data.table(combinat::combn2(unique(Group))), by = ID][, .(Sum = .N), keyby = .(Group.1 = V1, Group.2 = V2)]`. `combinat::combn2()` doesn't throw an error in case of `.N < 2` and no `transpose()` is required. – Uwe Dec 23 '17 at 14:49
@Uwe cool :) taking a glance at the [source](https://github.com/cran/combinat/blob/master/R/combn2.R#L4) of `combn2`, I see it just does a `t` internally, so I guess `transpose` will be more efficient :) (ps last edited 1994!? :o ) – MichaelChirico Dec 23 '17 at 15:05
A benchmark comparison would be quite interesting (if not already posted on SO). – Uwe Dec 23 '17 at 15:11
@Uwe was thinking the same, but a way to benchmark didn't jump immediately to mind so i gave up :p – MichaelChirico Dec 23 '17 at 15:22

score 3 · Answer 2 · edited Jun 20 '20 at 09:12

3

As of version 1.9.8 (on CRAN 25 Nov 2016), data.table has gained the ability to do non-equi joins. So, a self non-equi join can be used:

library(data.table) # v1.9.8+
setDT(DT)[, Group:= factor(Group)]
DT[DT, on = .(ID, Group < Group), nomatch = 0L, .(ID, x.Group, i.Group)][
  , .N, by = .(x.Group, i.Group)]

   x.Group i.Group N
1:       A       B 2
2:       A       C 3
3:       B       C 3

Explanantion

The non-equi join on ID, Group < Group is a data.table version of combn() (but applied group-wise):

DT[DT, on = .(ID, Group < Group), nomatch = 0L, .(ID, x.Group, i.Group)]

   ID x.Group i.Group
1:  1       A       B
2:  1       A       C
3:  1       B       C
4:  2       B       C
5:  4       A       C
6:  5       A       B
7:  5       A       C
8:  5       B       C

edited Jun 20 '20 at 09:12

Community

1
1

answered Dec 11 '17 at 09:59

Uwe

41,420
11
90
134

nice use of self NE join! very SQL :p ; thanks for bringing this Q to my attention, my answer was quite sloppy and I got a chance to improve it :) – MichaelChirico Dec 23 '17 at 13:00
1

@MichaelChirico I' trying to learn SQL through `data.table` ;-) – Uwe Dec 23 '17 at 14:52
well then, your next assignment is to implement my own approach in SQL :p – MichaelChirico Dec 23 '17 at 15:22

score 2 · Answer 3 · edited Dec 23 '17 at 13:16

2

We self-join with the same dataset on 'ID', subset the rows where the 'Group' columns are different, get the nrows (.N), grouped by the 'Group' columns, sort the 'Group.1' and 'Group.2' columns by row using pmin/pmax and get the unique value of 'N'.

 library(data.table)#v1.9.6+
 DT[DT, on='ID', allow.cartesian=TRUE][Group!=i.Group, .N ,.(Group, i.Group)][, 
      list(Sum=unique(N)) ,.(Group.1=pmin(Group, i.Group), Group.2=pmax(Group, i.Group))]

#   Group.1 Group.2 Sum
#1:       A       B   2
#2:       A       C   3
#3:       B       C   3

Or as mentioned in the comments by @MichaelChirico and @Frank, we can convert 'Group' to factor class, subset the rows based on as.integer(Group) < as.integer(i.Group), group by 'Group', 'i.Group' and get the nrow (.N)

DT[, Group:= factor(Group)]
DT[DT, on='ID', allow.cartesian=TRUE][as.integer(Group) < as.integer(i.Group), .N, 
                       by = .(Group.1= Group, Group.2= i.Group)]

edited Dec 23 '17 at 13:16

MichaelChirico

33,841
14
113
198

answered Nov 09 '15 at 16:41

akrun

874,273
37
540
662

Thanks! Exactly what I was looking for! – CC89 Nov 09 '15 at 16:54
1

@akrun I like your approach but I would do the following: first convert `Group` to a `factor` (if it's not already), then simply add `as.integer(Group) – MichaelChirico Nov 09 '15 at 17:02
1

Yeah, I agree with Michael. I'd do `dt[dt, on="ID", allow.cartesian=TRUE][Group < i.Group, .N, by=.(G1=Group,G2=i.Group)]` on the assumption that folks are not listed multiple times in the same group (so `uniqueN` is not needed). – Frank Nov 09 '15 at 17:58

score 1 · Answer 4 · edited Dec 23 '17 at 13:16

yet another solution (base R):

tmp <- split(DT, DT[, 'Group'])
ans <- apply(combn(LETTERS[1 : 3], 2), 2, FUN = function(ind){
            out <- length(intersect(tmp[[ind[1]]][, 1], tmp[[ind[2]]][, 1]))
            c(group1 = ind[1], group2 = ind[2], sum_ = out) 
                }
            )

data.frame(t(ans))

#  group1 group2 sum_
#1      A      B    2
#2      A      C    3
#3      B      C    3

first split data into list of groups, then for each unique pairwise combinations of two groups see how many subjects in common they have, using length(intersect(....

score 1 · Answer 5 · edited Dec 23 '17 at 13:15

Great answers above. Just an alternative using dplyr in case you, or someone else, is interested.

library(dplyr)

cmb = combn(unique(dt$Group),2)

data.frame(g1 = cmb[1,],
           g2 = cmb[2,]) %>%
  group_by(g1,g2) %>%
  summarise(l=length(intersect(DT[DT$Group==g1,]$ID,
                               DT[DT$Group==g2,]$ID)))

    #       g1     g2     l
    #    (fctr) (fctr) (int)
    # 1      A      B     2
    # 2      A      C     3
    # 3      B      C     3

Get number of same individuals for different groups

5 Answers5

Explanantion