How to count overlaps among the members of dplyr::group_by

Question

I have the following tibble:

library(tidyverse)
df <- tibble::tribble(
  ~gene, ~celltype,
  "a",   "cel1_1",  
  "b",   "cel1_1",  
  "c",   "cel1_1",  
  "a",   "cell_2",  
  "b",   "cell_2",  
  "c",   "cell_3",  
  "d",   "cell_3"
)

df %>% group_by(celltype)
#> Source: local data frame [7 x 2]
#> Groups: celltype [3]
#> 
#> # A tibble: 7 x 2
#>    gene celltype
#>   <chr>    <chr>
#> 1     a   cel1_1
#> 2     b   cel1_1
#> 3     c   cel1_1
#> 4     a   cell_2
#> 5     b   cell_2
#> 6     c   cell_3
#> 7     d   cell_3

The genes in the overlap can be grouped the following way

 cell1   a,b,c
 cell2   a,b
 cell3   c,d

What I want to do is to calculate gene overlap for all cell, resulting in this table:

          cell1    cell2     cell3
 cell1    3          2          1 
 cell2    2          2          0
 cell3    1          0          2

How can I achieve that?

Update

And finally calculate percentage (divide by largest denominator in pair)

          #cell1                cell2           cell3
 cell1    1.00(3/3)          0.67 (2/3)         0.33 (1/3)
 cell2    0.67 (2/3)         1.00               0
 cell3    0.33 (1/3)         0                  1.00

I tried this but doesn't get what I want:

> tmp <- crossprod(table(df))
> tmp/max(tmp)
        celltype
celltype    cel1_1    cell_2    cell_3
  cel1_1 1.0000000 0.6666667 0.3333333
  cell_2 0.6666667 0.6666667 0.0000000
  cell_3 0.3333333 0.0000000 0.6666667

So the diagonal will always have value 1.00.

If i understand, `res <- tmp/max(tmp); diag(res) <- 1` – akrun May 29 '17 at 05:56 — akrun, May 29 '17 at 05:56

akrun · Accepted Answer · 2017-05-29T04:53:14.227

5

We can use table with crossprod

crossprod(table(df))
#       celltype
#celltype cell_1 cell_2 cell_3
#  cell_1      3      2      1
#  cell_2      2      2      0
#  cell_3      1      0      2

Or another option is tidyverse

library(tidyverse)
count(df, gene, celltype) %>% 
       spread(celltype, n, fill = 0) %>%
       select(-gene) %>% 
       as.matrix %>% 
       crossprod
#        cel1_1 cell_2 cell_3
#cel1_1      3      2      1
#cell_2      2      2      0
#cell_3      1      0      2

Or with data.table

library(data.table)
crossprod(as.matrix(dcast(setDT(df), gene~celltype, length)[,-1]))

edited May 29 '17 at 04:53

answered May 29 '17 at 04:43

akrun

874,273
37
540
662

not quite. I'm looking for overlap for example cell1 (a,b,c) cell2 (c,d) the overlap is (c) so thevalue in cell1,cell3 is 1. – pdubois May 29 '17 at 04:45
1

@pdubois I get the expected output with this – akrun May 29 '17 at 04:47
you're right. there is a bug in my OP. I fixed it. – pdubois May 29 '17 at 04:48
I got error with larger df `> dim(df) [1] 370539 5`. Mesg is `Error in table(df) : attempt to make a table with >= 2^31 elements`. Wait. I'll provide link to full file. – pdubois May 29 '17 at 04:51
1

@pdubois How big is your dataset? I updated with a tidyverse option and data.table option. – akrun May 29 '17 at 04:53
sorry my bad. your code works fine with my data. – pdubois May 29 '17 at 04:55
1

I forgot to assign dplyr pipe sequence to a variable `df`. – pdubois May 29 '17 at 04:58
Would you mind see my update. I need to calculate percentage. – pdubois May 29 '17 at 05:05
1

@pdubois: simply assign it to tmp, then `tmp <- tmp/max(tmp)` – smci May 29 '17 at 05:11
@smci: I tried. The diagonal is not 1.00. See my update. – pdubois May 29 '17 at 05:14
1

@pdubois. Akrun has added the answer to that too, not in the answer, but in a comment on your question. This is like hide-and-seek ;-) What have we got against being linear... – smci May 29 '17 at 06:24

How to count overlaps among the members of dplyr::group_by

1 Answers1