Find the most common combinations within each group in R

Question

I have the following dataset, showing the INGREDIENTS contained in each PRODUCT;

data <- data.frame("PRODUCT" = c("Creme","Creme","Creme","Creme","Medoc","Medoc","Medoc","Medoc","Medoc","Hububu","Hububu","Hububu","Hububu","Troll","Troll","Troll","Troll","Suzuki","Suzuki","Gluglu","Gluglu","Gluglu"), 
            "INGREDIENT" = c("zeze","zaza","zozo","zuzu","zaza","sasa","haha","zuzu","zemzem","zaza","zuzu","zizi","haha","zozo","zaza","zemzem","zuzu","sasa","zuzu","ozam","zaza","hayda"))

I want to know the most common combinations of INGREDIENTS in each PRODUCT; which ingredient is associated with which other ingredient ? I applied the code I found in this thread here :

combinaisons_par_PRODUCT = data %>% 
  full_join(data, by="PRODUCT") %>% 
  group_by(INGREDIENT.x, INGREDIENT.y) %>% 
  summarise(n = length(unique(PRODUCT))) %>% 
  filter(INGREDIENT.x!=INGREDIENT.y) %>%
  mutate(item = paste(INGREDIENT.x, INGREDIENT.y, sep=", "))

It works but there is one final flaw; I would like the order to be ignored. For instance, this code, would give me 1 association of HAHA and SASA, and also 1 association of SASA and HAHA. But for me, these are the same things. So I would like the code to ignore the order of INGREDIENTS and give me one unique association of 2 HAHA & SASA.

I tried sorting the INGREDIENTS before applying the code, but it didn't work either. Could someone help me please? How can I have these combinations unregarding the order ?

Thank you very much!

score 2 · Accepted Answer · answered Jun 19 '21 at 16:30

2

Does this do what you want? I'm limiting to only situations where the combos are in alphabetical order, avoiding double counts.

data %>% 
  full_join(data, by="PRODUCT") %>%
  filter(INGREDIENT.x < INGREDIENT.y) %>%
  count(combo = paste(INGREDIENT.x, INGREDIENT.y, sep = ", "))

answered Jun 19 '21 at 16:30

Jon Spring

55,165
4
35
53

Yes, this seems to work perfectly thank you! – Siva Kg Jun 19 '21 at 21:40

ThomasIsCoding · Answer 2 · 2021-06-19T21:47:59.847

An igraph option using graph_from_adjacency_matrix

library(igraph)

get.data.frame(
    graph_from_adjacency_matrix(
        crossprod(table(data)),
        mode = "undirected",
        weighted = TRUE
    )
)

gives

     from     to weight
1    haha   haha      2
2    haha   sasa      1
3    haha   zaza      2
4    haha zemzem      1
5    haha   zizi      1
6    haha   zuzu      2
7   hayda  hayda      1
8   hayda   ozam      1
9   hayda   zaza      1
10   ozam   ozam      1
11   ozam   zaza      1
12   sasa   sasa      2
13   sasa   zaza      1
14   sasa zemzem      1
15   sasa   zuzu      2
16   zaza   zaza      5
17   zaza zemzem      2
18   zaza   zeze      1
19   zaza   zizi      1
20   zaza   zozo      2
21   zaza   zuzu      4
22 zemzem zemzem      2
23 zemzem   zozo      1
24 zemzem   zuzu      2
25   zeze   zeze      1
26   zeze   zozo      1
27   zeze   zuzu      1
28   zizi   zizi      1
29   zizi   zuzu      1
30   zozo   zozo      2
31   zozo   zuzu      2
32   zuzu   zuzu      5

Thank you very much! I tried but I get the error "Error in get.data.frame(graph_from_adjacency_matrix(crossprod(table(data)), : cannot find "get.data.frame" — Siva Kg, Jun 19 '21 at 21:45
@SivaKg Sorry, my bad. You should add `library(igraph)` in front of my code. See my update. — ThomasIsCoding, Jun 19 '21 at 21:47

akrun · Answer 3 · 2021-06-19T20:05:55.473

1

We could use base R

m1 <- crossprod(table(data))
subset(as.data.frame.table(m1 * lower.tri(m1, diag = TRUE)), Freq != 0)

EDIT: Comments from @ThomasIsCoding

edited Jun 19 '21 at 20:05

answered Jun 19 '21 at 17:24

akrun

874,273
37
540
662

1

Good idea with `crossprod`, upvoted! By the way, maybe `mat <- crossprod(table(data)); mat[upper.tri(mat)] <- 0; out <- subset(as.data.frame.table(mat), Freq != 0)` is the code OP wants, since `HAHA & SASA` and `SASA & HAHA` is isomorphic – ThomasIsCoding Jun 19 '21 at 20:03
1

@ThomasIsCoding thank you. I was not sure what is the expected output. Updated the post – akrun Jun 19 '21 at 20:06
1

This is indeed a great idea and I like the representation as a crosstable. But this was a mock data. My real data has 4 million lines and thousands of unique ingredients. So R doesn't like it :( but thank you very much! – Siva Kg Jun 19 '21 at 21:43

Find the most common combinations within each group in R

3 Answers3