How to identify and tally intersection items in R

Question

I have a data frame which shows membership in three color classes. Numbers refer to unique IDs. One ID may be a part of one group or multiple groups.

dat <- data.frame(BLUE = c(1, 2, 3, 4, 6, NA),
                  RED = c(2, 3, 6, 7, 9, 13),
                  GREEN = c(4, 6, 8, 9, 10, 11))

or for visual reference:

BLUE  RED  GREEN
1     2    4
2     3    6
3     6    8
4     7    9
6     9    10
NA    13   11

I need to identify and tally individual and cross group membership (i.e. how many IDs were only in red, how many were in both red and blue, etc.) My desired output is below. Please note that the IDs column is simply for reference, that column would not be in the expected output.

COLOR                TOTAL  IDs (reference only, not needed in final output)
RED                  2      (7, 13)
BLUE                 1      (1)
GREEN                3      (8, 10, 11)
RED, BLUE            3      (2, 3, 6)
RED, GREEN           2      (6, 9)
BLUE, GREEN          2      (4, 6)
RED, BLUE, GREEN     1      (6)

Does anyone know an efficient way to do this in R? Thanks!

The number 2 refers to the number of IDs that are in red and only red. Since 7 and 13 are in red, but not in blue or green, the total for red is 2. All other numbers in red appear in other groups. 7 and 13 are just random numbers that I created for illustrative purposes, but do act the same way that the IDs appear in the real data set (i.e. some are in only one group, and some are in multiple groups) — DJC, Sep 26 '19 at 15:39
Why is 6 listed in `RED, BLUE, GREEN`, as well as all 3 pairwise groups? I interpreted this as you only want each ID to be counted once (it its maximal group). — ClancyStats, Sep 26 '19 at 15:41
Not sure if I understand question here, but 6 is listed in red, blue, green row individually, as I need to be able to identify how many IDs were in all three groups at once. — DJC, Sep 26 '19 at 15:45

tmfmnk · Answer 1 · 2019-09-26T17:10:28.080

You can use the venn library (especially suited for situations when you do not have NAs in your data):

venn_table <- venn(as.list(dat))

               BLUE RED GREEN counts
                  0   0     0      0
GREEN             0   0     1      3
RED               0   1     0      2
RED:GREEN         0   1     1      1
BLUE              1   0     0      2
BLUE:GREEN        1   0     1      1
BLUE:RED          1   1     0      2
BLUE:RED:GREEN    1   1     1      1

And:

attr(venn_table, "intersections")

$GREEN
[1]  8 10 11

$RED
[1]  7 13

$`RED:GREEN`
[1] 9

$BLUE
[1]  1 NA

$`BLUE:GREEN`
[1] 4

$`BLUE:RED`
[1] 2 3

$`BLUE:RED:GREEN`
[1] 6

To include also the IDs:

data.frame(venn_table[2:nrow(venn_table), ],
           ID = do.call("rbind", lapply(attr(venn_table, "intersections"), paste0, collapse = ",")))

               BLUE RED GREEN counts      ID
GREEN             0   0     1      3 8,10,11
RED               0   1     0      2    7,13
RED:GREEN         0   1     1      1       9
BLUE              1   0     0      2    1,NA
BLUE:GREEN        1   0     1      1       4
BLUE:RED          1   1     0      2     2,3
BLUE:RED:GREEN    1   1     1      1       6

One way to deal with the the NAs:

venn_table2 <- data.frame(venn_table[2:nrow(venn_table), length(venn_table), drop = FALSE],
                          ID = do.call("rbind", lapply(attr(venn_table, "intersections"), paste0, collapse = ",")))

counts <- venn_table2[1] - with(venn_table2, lengths(regmatches(ID, gregexpr("NA", ID))))

               counts
GREEN               3
RED                 2
RED:GREEN           1
BLUE                1
BLUE:GREEN          1
BLUE:RED            2
BLUE:RED:GREEN      1

And a more elegant way to deal with the NAs could be (based on a comment from @M--):

print(venn(Map(function(x) x[!is.na(x)], as.list(dat))))

               BLUE RED GREEN counts
                  0   0     0      0
GREEN             0   0     1      3
RED               0   1     0      2
RED:GREEN         0   1     1      1
BLUE              1   0     0      1
BLUE:GREEN        1   0     1      1
BLUE:RED          1   1     0      2
BLUE:RED:GREEN    1   1     1      1

@M, thanks for pointing that out. I actually didn't need the IDs column in the eventual output so no worries on that. This method is much more simple, but I am noticing the it is counting the NA in blue as a unique value (thus the blue total is 2, when it should be 1), any idea how to fix that? — DJC, Sep 26 '19 at 16:19

M-- · Accepted Answer · 2019-09-26T16:49:30.317

2

library(dplyr)
library(tidyr)

cbind(dat, row = 1:6) %>% 
  gather(COLOR, IDs, -row) %>% 
  group_by(IDs) %>% 
  nest(COLOR, .key="COLOR") %>% 
  mutate(COLOR = sapply(COLOR, as.character)) %>% 
  drop_na %>% 
  group_by(COLOR) %>% 
  add_count(name="TOTAL") %>% 
  group_by(COLOR, TOTAL) %>% 
  nest(IDs, .key = "IDs") %>% 
  as.data.frame

#>                       COLOR TOTAL       IDs
#> 1                      BLUE     1         1
#> 2          c("BLUE", "RED")     2      2, 3
#> 3        c("BLUE", "GREEN")     1         4
#> 4 c("BLUE", "RED", "GREEN")     1         6
#> 5                       RED     2     7, 13
#> 6         c("RED", "GREEN")     1         9
#> 7                     GREEN     3 8, 10, 11

There's a more conventional method to deal with NA in venn package:

library(purrr)
library(magrittr)
library(venn)

as.list(dat) %>%
  map(discard, is.na) %>%
  compact() %>% 
  venn() %>% 
  print

    #>                BLUE RED GREEN counts
    #>                   0   0     0      0
    #> GREEN             0   0     1      3
    #> RED               0   1     0      2
    #> RED:GREEN         0   1     1      1
    #> BLUE              1   0     0      1
    #> BLUE:GREEN        1   0     1      1
    #> BLUE:RED          1   1     0      2
    #> BLUE:RED:GREEN    1   1     1      1

There are many other packages for venn diagram in R according to this answer.

For instance, VennDiagram::venn.diagram package has an na variable which gets stop, remove, and none. So, here we would use remove; however, it will only give us the diagram and not the table. You can explore other possibilities in other packages.

edited Sep 26 '19 at 16:49

answered Sep 26 '19 at 16:11

M--

25,431
8
61
93

Agreed. Not cool whoever did that. Sorry all - I have a follow-up to both @M and @tmfmnk solutions. On both of your solutions, combinations like RED:GREEN = 1, or RED:BLUE = 2, but shouldn't those be returning as 2 and 3 respectively? Maybe I'm going crazy, but re-did the counts and that's how they should be returning no? – DJC Sep 26 '19 at 17:16
1

@DJC No, because `6` is in `Red, Green, Blue`. You are counting it more than once. If we want to follow your logic, beside `Red, Green = 2` and `Red, Blue = 3`, `Blue, Green` also should be `2` and not `1`. See the graph that I have above for clarification. p.s. you can only tag one person per comment. *tmfmnk* has not been notified on this. – M-- Sep 26 '19 at 17:22
Thanks for the heads up. Is there anyway to count it in the way that I'm describing? Even though I know that logically, if something is in RBG, then by extension it is in RB, RG, and BG, I would still need those two-pair counts regardless. Is the easiest way to just take that blanket combo total across all groups and simply add it to all other combination groups, but not the individual colors? Thank you so so so much by the way. Really appreciate both of your help with this – DJC Sep 26 '19 at 17:27
1

@DJC need some time to think about it. But I think I can hack `tidyverse` solution to make it work as you wish. – M-- Sep 26 '19 at 17:39
2

I'm sure that it could be done with `tidyverse`, however, I don't think it can be meaningfully addressed through a Venn diagram approach. – tmfmnk Sep 26 '19 at 17:44
Thanks. Right now I've modified your code as shown below. My approach is to check if there are two or more colons in COLOR. If there are, I want to index that total, and then add that number to any COLOR with a single colon ":". This might be tricky given that in reality, there are 5 groups, but we'll see haha – DJC Sep 26 '19 at 17:45
dat %>% gather(COLOR, IDs) %>% nest(COLOR, .key="COLOR") %>% mutate(COLOR = sapply(COLOR, as.character), COLOR = str_replace_all(COLOR, "[^[:alnum:]]", " "), COLOR = gsub("c ", "", COLOR) %>% str_trim, COLOR = gsub(" ", ":", COLOR)) %>% drop_na %>% group_by(COLOR) %>% add_count(name="TOTAL") %>% distinct(COLOR, TOTAL) %>% mutate(TEST = str_detect(COLOR, ":")) – DJC Sep 26 '19 at 17:45
@DJC question. If you want rgb in rg, bg, rb then do you want all of these in r and b and g as well making them to be count of 6 (or 5 for blue as it has one NA). I am trying to say that your logic is inconsistent. – M-- Sep 26 '19 at 17:48
I don't think so. I only want this to apply to combinations, not individual categories, hence my looking for the presence of a semi-colon, indicating multiple categories. Maybe I've just confused myself :/ – DJC Sep 26 '19 at 17:51
@DJC I wasn't talking about implementation of it. I was saying it is conceptually flawed as you are counting interactions twice but excluding them from the individuals. Anyhow, happy that it worked for you after all. – M-- Sep 26 '19 at 23:28

How to identify and tally intersection items in R

2 Answers2