4

I have a data set with individuals (ID) that can be part of more than one group.

Example:

library(data.table)
DT <- data.table(
  ID = rep(1:5, c(3:1, 2:3)),
  Group = c("A", "B", "C", "B",
            "C", "A", "A", "C",
            "A", "B", "C")
)
DT
#     ID Group
#  1:  1     A
#  2:  1     B
#  3:  1     C
#  4:  2     B
#  5:  2     C
#  6:  3     A
#  7:  4     A
#  8:  4     C
#  9:  5     A
# 10:  5     B
# 11:  5     C

I want to know the sum of identical individuals for 2 groups.

The result should look like this:

  Group.1    Group.2    Sum
    A           B        2
    A           C        3
    B           C        3

Where Sum indicates the number of individuals the two groups have in common.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
CC89
  • 103
  • 4
  • 1
    See, also, this [post](http://stackoverflow.com/questions/19891278/r-table-of-interactions-case-with-pets-and-houses). You can combine that answer with `as.data.frame(as.table())` and `complete.cases`. – alexis_laz Nov 09 '15 at 17:02
  • 1
    `out <- with(dt, crossprod(table(ID, Group))); out[lower.tri(out, diag=TRUE)] <- NA; na.omit(as.data.frame.table(out))` from the link above – rawr Nov 09 '15 at 17:12
  • @ alexis_laz and @rawr also nice solution! Thanks. – CC89 Nov 12 '15 at 11:05

5 Answers5

4

Here's my version:

# size-1 IDs can't contribute; skip
DT[ , if (.N > 1) 
  # simplify = FALSE returns a list;
  #   transpose turns the 3-length list of 2-length vectors
  #   into a length-2 list of 3-length vectors (efficiently)
  transpose(combn(Group, 2L, simplify = FALSE)), by = ID
  ][ , .(Sum = .N), keyby = .(Group.1 = V1, Group.2 = V2)]

With output:

#    Group.1 Group.2 Sum
# 1:       A       B   2
# 2:       A       C   3
# 3:       B       C   3
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • Alternatively, you can use `DT[ , as.data.table(combinat::combn2(unique(Group))), by = ID][, .(Sum = .N), keyby = .(Group.1 = V1, Group.2 = V2)]`. `combinat::combn2()` doesn't throw an error in case of `.N < 2` and no `transpose()` is required. – Uwe Dec 23 '17 at 14:49
  • @Uwe cool :) taking a glance at the [source](https://github.com/cran/combinat/blob/master/R/combn2.R#L4) of `combn2`, I see it just does a `t` internally, so I guess `transpose` will be more efficient :) (ps last edited 1994!? :o ) – MichaelChirico Dec 23 '17 at 15:05
  • A benchmark comparison would be quite interesting (if not already posted on SO). – Uwe Dec 23 '17 at 15:11
  • @Uwe was thinking the same, but a way to benchmark didn't jump immediately to mind so i gave up :p – MichaelChirico Dec 23 '17 at 15:22
3

As of version 1.9.8 (on CRAN 25 Nov 2016), data.table has gained the ability to do non-equi joins. So, a self non-equi join can be used:

library(data.table) # v1.9.8+
setDT(DT)[, Group:= factor(Group)]
DT[DT, on = .(ID, Group < Group), nomatch = 0L, .(ID, x.Group, i.Group)][
  , .N, by = .(x.Group, i.Group)]
   x.Group i.Group N
1:       A       B 2
2:       A       C 3
3:       B       C 3

Explanantion

The non-equi join on ID, Group < Group is a data.table version of combn() (but applied group-wise):

DT[DT, on = .(ID, Group < Group), nomatch = 0L, .(ID, x.Group, i.Group)]
   ID x.Group i.Group
1:  1       A       B
2:  1       A       C
3:  1       B       C
4:  2       B       C
5:  4       A       C
6:  5       A       B
7:  5       A       C
8:  5       B       C
Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134
2

We self-join with the same dataset on 'ID', subset the rows where the 'Group' columns are different, get the nrows (.N), grouped by the 'Group' columns, sort the 'Group.1' and 'Group.2' columns by row using pmin/pmax and get the unique value of 'N'.

 library(data.table)#v1.9.6+
 DT[DT, on='ID', allow.cartesian=TRUE][Group!=i.Group, .N ,.(Group, i.Group)][, 
      list(Sum=unique(N)) ,.(Group.1=pmin(Group, i.Group), Group.2=pmax(Group, i.Group))]

#   Group.1 Group.2 Sum
#1:       A       B   2
#2:       A       C   3
#3:       B       C   3

Or as mentioned in the comments by @MichaelChirico and @Frank, we can convert 'Group' to factor class, subset the rows based on as.integer(Group) < as.integer(i.Group), group by 'Group', 'i.Group' and get the nrow (.N)

DT[, Group:= factor(Group)]
DT[DT, on='ID', allow.cartesian=TRUE][as.integer(Group) < as.integer(i.Group), .N, 
                       by = .(Group.1= Group, Group.2= i.Group)] 
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks! Exactly what I was looking for! – CC89 Nov 09 '15 at 16:54
  • 1
    @akrun I like your approach but I would do the following: first convert `Group` to a `factor` (if it's not already), then simply add `as.integer(Group) – MichaelChirico Nov 09 '15 at 17:02
  • 1
    Yeah, I agree with Michael. I'd do `dt[dt, on="ID", allow.cartesian=TRUE][Group < i.Group, .N, by=.(G1=Group,G2=i.Group)]` on the assumption that folks are not listed multiple times in the same group (so `uniqueN` is not needed). – Frank Nov 09 '15 at 17:58
1

yet another solution (base R):

tmp <- split(DT, DT[, 'Group'])
ans <- apply(combn(LETTERS[1 : 3], 2), 2, FUN = function(ind){
            out <- length(intersect(tmp[[ind[1]]][, 1], tmp[[ind[2]]][, 1]))
            c(group1 = ind[1], group2 = ind[2], sum_ = out) 
                }
            )

data.frame(t(ans))

#  group1 group2 sum_
#1      A      B    2
#2      A      C    3
#3      B      C    3

first split data into list of groups, then for each unique pairwise combinations of two groups see how many subjects in common they have, using length(intersect(....

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
stas g
  • 1,503
  • 2
  • 10
  • 20
1

Great answers above. Just an alternative using dplyr in case you, or someone else, is interested.

library(dplyr)

cmb = combn(unique(dt$Group),2)

data.frame(g1 = cmb[1,],
           g2 = cmb[2,]) %>%
  group_by(g1,g2) %>%
  summarise(l=length(intersect(DT[DT$Group==g1,]$ID,
                               DT[DT$Group==g2,]$ID)))

    #       g1     g2     l
    #    (fctr) (fctr) (int)
    # 1      A      B     2
    # 2      A      C     3
    # 3      B      C     3
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
AntoniosK
  • 15,991
  • 2
  • 19
  • 32