How to find bucket overlap based on common items in r

Question

I have data like this, where buckets can have different numbers of items:

Bucket A | Item 1
Bucket A | Item 2
Bucket A | Item 3
Bucket B | Item 3
Bucket B | Item 4
Bucket C | Item 1
Bucket C | Item 5
Bucket C | Item 2

I want to find the item overlap of all buckets, so I get it in the following format (with the base bucket on the left):

         Bucket A | Bucket B | Bucket C
Bucket A   100%   |   33%    |    66%
Bucket B   50%    |   100%   |    0%
Bucket C   66%    |   0%     |    100%

Can you define 'bucket overlap', or make your example result match your example data? — thelatemail, Mar 15 '17 at 23:28
See [this post](http://stackoverflow.com/questions/19891278/r-table-of-interactions-case-with-pets-and-houses) -- here, `tab = table(dat); (tcrossprod(tab) / rowSums(tab)) * 100` — alexis_laz, Mar 16 '17 at 11:01

JasonWang · Accepted Answer · 2017-03-16T00:05:29.090

Here is a way using dplyr:

temp <- df %>%
  group_by(V2) %>%
  do(expand.grid(.$V1, .$V1, stringsAsFactors=FALSE)) %>%
  ungroup() %>%
  select(Var1, Var2) %>%
  table()
temp / diag(temp)

           Var2
Var1        Bucket A  Bucket B  Bucket C 
  Bucket A  1.0000000 0.3333333 0.6666667
  Bucket B  0.5000000 1.0000000 0.0000000
  Bucket C  0.6666667 0.0000000 1.0000000

Data

df <- structure(list(V1 = c("Bucket A ", "Bucket A ", "Bucket A ", 
"Bucket B ", "Bucket B ", "Bucket C ", "Bucket C ", "Bucket C "
), V2 = c(" Item 1", " Item 2", " Item 3", " Item 3", " Item 4", 
" Item 1", " Item 5", " Item 2")), .Names = c("V1", "V2"), class = "data.frame", row.names = c(NA, 
-8L))

You have an extra parentheses in there, but otherwise works. Thank you! — NBC, Mar 16 '17 at 00:04

How to find bucket overlap based on common items in r

1 Answers1