4

I have data like this, where buckets can have different numbers of items:

Bucket A | Item 1
Bucket A | Item 2
Bucket A | Item 3
Bucket B | Item 3
Bucket B | Item 4
Bucket C | Item 1
Bucket C | Item 5
Bucket C | Item 2

I want to find the item overlap of all buckets, so I get it in the following format (with the base bucket on the left):

         Bucket A | Bucket B | Bucket C
Bucket A   100%   |   33%    |    66%
Bucket B   50%    |   100%   |    0%
Bucket C   66%    |   0%     |    100%
NBC
  • 1,606
  • 4
  • 18
  • 31

1 Answers1

2

Here is a way using dplyr:

temp <- df %>%
  group_by(V2) %>%
  do(expand.grid(.$V1, .$V1, stringsAsFactors=FALSE)) %>%
  ungroup() %>%
  select(Var1, Var2) %>%
  table()
temp / diag(temp)

           Var2
Var1        Bucket A  Bucket B  Bucket C 
  Bucket A  1.0000000 0.3333333 0.6666667
  Bucket B  0.5000000 1.0000000 0.0000000
  Bucket C  0.6666667 0.0000000 1.0000000

Data

df <- structure(list(V1 = c("Bucket A ", "Bucket A ", "Bucket A ", 
"Bucket B ", "Bucket B ", "Bucket C ", "Bucket C ", "Bucket C "
), V2 = c(" Item 1", " Item 2", " Item 3", " Item 3", " Item 4", 
" Item 1", " Item 5", " Item 2")), .Names = c("V1", "V2"), class = "data.frame", row.names = c(NA, 
-8L))
JasonWang
  • 2,414
  • 11
  • 12