1

I have the following dataset, showing the INGREDIENTS contained in each PRODUCT;

data <- data.frame("PRODUCT" = c("Creme","Creme","Creme","Creme","Medoc","Medoc","Medoc","Medoc","Medoc","Hububu","Hububu","Hububu","Hububu","Troll","Troll","Troll","Troll","Suzuki","Suzuki","Gluglu","Gluglu","Gluglu"), 
            "INGREDIENT" = c("zeze","zaza","zozo","zuzu","zaza","sasa","haha","zuzu","zemzem","zaza","zuzu","zizi","haha","zozo","zaza","zemzem","zuzu","sasa","zuzu","ozam","zaza","hayda"))

I want to know the most common combinations of INGREDIENTS in each PRODUCT; which ingredient is associated with which other ingredient ? I applied the code I found in this thread here :

combinaisons_par_PRODUCT = data %>% 
  full_join(data, by="PRODUCT") %>% 
  group_by(INGREDIENT.x, INGREDIENT.y) %>% 
  summarise(n = length(unique(PRODUCT))) %>% 
  filter(INGREDIENT.x!=INGREDIENT.y) %>%
  mutate(item = paste(INGREDIENT.x, INGREDIENT.y, sep=", "))

It works but there is one final flaw; I would like the order to be ignored. For instance, this code, would give me 1 association of HAHA and SASA, and also 1 association of SASA and HAHA. But for me, these are the same things. So I would like the code to ignore the order of INGREDIENTS and give me one unique association of 2 HAHA & SASA.

I tried sorting the INGREDIENTS before applying the code, but it didn't work either. Could someone help me please? How can I have these combinations unregarding the order ?

Thank you very much!

Siva Kg
  • 59
  • 8

3 Answers3

2

Does this do what you want? I'm limiting to only situations where the combos are in alphabetical order, avoiding double counts.

data %>% 
  full_join(data, by="PRODUCT") %>%
  filter(INGREDIENT.x < INGREDIENT.y) %>%
  count(combo = paste(INGREDIENT.x, INGREDIENT.y, sep = ", "))
Jon Spring
  • 55,165
  • 4
  • 35
  • 53
2

An igraph option using graph_from_adjacency_matrix

library(igraph)

get.data.frame(
    graph_from_adjacency_matrix(
        crossprod(table(data)),
        mode = "undirected",
        weighted = TRUE
    )
)

gives

     from     to weight
1    haha   haha      2
2    haha   sasa      1
3    haha   zaza      2
4    haha zemzem      1
5    haha   zizi      1
6    haha   zuzu      2
7   hayda  hayda      1
8   hayda   ozam      1
9   hayda   zaza      1
10   ozam   ozam      1
11   ozam   zaza      1
12   sasa   sasa      2
13   sasa   zaza      1
14   sasa zemzem      1
15   sasa   zuzu      2
16   zaza   zaza      5
17   zaza zemzem      2
18   zaza   zeze      1
19   zaza   zizi      1
20   zaza   zozo      2
21   zaza   zuzu      4
22 zemzem zemzem      2
23 zemzem   zozo      1
24 zemzem   zuzu      2
25   zeze   zeze      1
26   zeze   zozo      1
27   zeze   zuzu      1
28   zizi   zizi      1
29   zizi   zuzu      1
30   zozo   zozo      2
31   zozo   zuzu      2
32   zuzu   zuzu      5
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
  • Thank you very much! I tried but I get the error "Error in get.data.frame(graph_from_adjacency_matrix(crossprod(table(data)), : cannot find "get.data.frame" – Siva Kg Jun 19 '21 at 21:45
  • 1
    @SivaKg Sorry, my bad. You should add `library(igraph)` in front of my code. See my update. – ThomasIsCoding Jun 19 '21 at 21:47
1

We could use base R

m1 <- crossprod(table(data))
subset(as.data.frame.table(m1 * lower.tri(m1, diag = TRUE)), Freq != 0)

EDIT: Comments from @ThomasIsCoding

akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Good idea with `crossprod`, upvoted! By the way, maybe `mat <- crossprod(table(data)); mat[upper.tri(mat)] <- 0; out <- subset(as.data.frame.table(mat), Freq != 0)` is the code OP wants, since `HAHA & SASA` and `SASA & HAHA` is isomorphic – ThomasIsCoding Jun 19 '21 at 20:03
  • 1
    @ThomasIsCoding thank you. I was not sure what is the expected output. Updated the post – akrun Jun 19 '21 at 20:06
  • 1
    This is indeed a great idea and I like the representation as a crosstable. But this was a mock data. My real data has 4 million lines and thousands of unique ingredients. So R doesn't like it :( but thank you very much! – Siva Kg Jun 19 '21 at 21:43