0

Suppose you have a dataframe with ids and elements prescripted to each id. For example:

example <- data.frame(id = c(1,1,1,1,1,2,2,2,3,4,4,4,4,4,4,4,5,5,5,5),
                      vals = c("a","b",'c','d','e','a','b','d','c',
                                 'd','f','g','h','a','k','l','m', 'a',
                                 'b', 'c'))

I want to find all possible pair combinations. The main struggle here is not the functional of R language that I can use, but the logic. How can I iterate through all elements and find patterns? For instance, a was picked with b 3 times in my sample dataframe. But original dataframe is more than 30k rows, so I cannot count these combinations manually. How do I automatize this process of finding the number of picks of each elements?

I was thinking about widening my df with pivot_wider and then using map_lgl to find matches. Then I faced the problem that it will take a lot of time for me to find all possible combinations, applying map_lgl for every pair of elements.

I was asking nearly the same question less than a month ago, fellow users answered it but the result is not anything I really need.

Do you have any ideas how to create a dataframe with all possible combinations of values for all ids?

rg4s
  • 811
  • 5
  • 22
  • Can you show us your expected output? – user2974951 Mar 04 '21 at 08:44
  • Which pair combinations do you need, `id` and `vals`, or between `vals`? And I'm not sure what you mean by "`a` was picked with `b` 3 times." – yh6 Mar 04 '21 at 09:02
  • @yh6 between vals, yes. – rg4s Mar 04 '21 at 09:30
  • @k1rgas So for example, for `id`=1 we have a, b, c, d, e as `vals`, then we create all possible pair combinations between these 5 characters. And we iterate this procedure over all ids, and finally count the number of the patterns of pairs. Is this correct? – yh6 Mar 04 '21 at 09:51
  • @yh6 something like that, but we want to count the number of patterns for all ids, not only for id1, id2, and etc. Like, ```id == 1``` has chosen ```a``` with ```b```, ```c```, ```d```. How many other ids have picked the same options? – rg4s Mar 04 '21 at 10:01
  • 1
    @k1rgas Yes, that's what I meant by "count the number of the patterns of pairs." For example, the pair (a, b) appears at id1, 2, and 5, so you want to get "3" for this pair, dont' you? – yh6 Mar 04 '21 at 10:05
  • 1
    @yh6 yes, you got it right. I was just clarifying for sure – rg4s Mar 04 '21 at 10:06
  • I saw that `arules` were suggested in [a comment on your previous post](https://stackoverflow.com/questions/66167864/how-to-count-each-column-values-frequency-combinations-in-r#comment116981754_66167864). In the post I linked to, you find that the `arules` code is rather straightforward and it was fast (at least compared to the other answers provided there). Good luck! Cheers – Henrik Mar 04 '21 at 11:11
  • Also, for pairs: `m = crossprod(table(example))`; `m[lower.tri(m, diag = TRUE)] = NA`; `na.omit(data.frame(as.table(m)))`; [Intersect all possible combinations of list elements](https://stackoverflow.com/questions/24614391/intersect-all-possible-combinations-of-list-elements) – Henrik Mar 05 '21 at 07:37
  • @Henrik yeah, thanks, but ```apriori``` does not work with my original dataset with more than 9k rows. R just ran out of memory. – rg4s Mar 06 '21 at 08:25
  • OK, then there must be some special features of your data / desired analysis. As you see in the link, I ran "10000 customers with up to 10 products each" in 60 milliseconds...Anyway: good luck! – Henrik Mar 06 '21 at 10:33
  • @Henrik yes, I see that your code is very fast. But mine can't allocate the vector of a big size (64 Mb). And it happens when I try convert the result of ```apriori``` function to a dataframe. Do you have any ideas how to manage it? – rg4s Mar 09 '21 at 09:58
  • 1
    In my example, I had set the value for the minimal support of an item set to 0: `support = 0`, and then removed itemsets of zero count _after_ I had coerced to data frame. If you set the `support` to, say, the default 0.1, you will exclude all itemsets with `0` support already in the "`apriori` step. This will of course decrease the size of the results. You may try it. See alse the nice [vignette](https://cran.r-project.org/web/packages/arules/vignettes/arules.pdf). Good luck! – Henrik Mar 09 '21 at 15:07

2 Answers2

1

I understand that this code is slow, but here is another example code to get the expected output based on tidyverse package.
What I do here is first create a nested dataframe by id, then produce all pair combinations for each id, unnest the dataframe, and finally count the pairs.

library(tidyverse)
example <- data.frame(
  id = c(1,1,1,1,1,2,2,2,3,4,4,4,4,4,4,4,5,5,5,5),
  vals = c("a","b",'c','d','e','a','b','d','c','d','f','g','h','a','k','l','m','a','b', 'c')
)
example %>% nest(dataset=-id) %>% mutate(dataset=map(dataset, function(dataset){
  if(nrow(dataset)>1){
    dataset %>% .$vals %>% combn(., 2) %>% t() %>% as_tibble(.name_repair=~c("val1", "val2")) %>% return()
  }else{
    return(NULL)
  }
})) %>% unnest(cols=dataset) %>% group_by(val1, val2) %>% summarize(n=n(), .groups="drop") %>% arrange(desc(n), val1, val2)
#> # A tibble: 34 x 3
#>    val1  val2      n
#>    <chr> <chr> <int>
#>  1 a     b         3
#>  2 a     c         2
#>  3 a     d         2
#>  4 b     c         2
#>  5 b     d         2
#>  6 a     e         1
#>  7 a     k         1
#>  8 a     l         1
#>  9 b     e         1
#> 10 c     d         1
#> # … with 24 more rows

Created on 2021-03-04 by the reprex package (v1.0.0)

yh6
  • 379
  • 2
  • 13
0

This won't (can't) be fast for many IDs. If it is too slow, you need to parallelize or implement it in a compiled language (e.g., using Rcpp).

We sort vals. We can then create all combination of two items grouped by ID. We exclude ID's with 1 item. Finally we tabulate the result.

library(data.table)
setDT(example)
setorder(example, id, vals)
example[, if (.N > 1) split(combn(vals, 2), 1:2), by = id][, .N, by = c("1", "2")]
#    1 2 N
# 1: a b 3
# 2: a c 2
# 3: a d 3
# 4: a e 1
# 5: b c 2
# 6: b d 2
# 7: b e 1
#<...>
Roland
  • 127,288
  • 10
  • 191
  • 288
  • Seems like your advice is valid. But I really do not understand how this code is working. Could you, please, explain what we do in the first and in the second square brackets? Like, if an id has picked more than one option, we split all options pairwise and see their combinations? – rg4s Mar 04 '21 at 10:02
  • I believe my answer explains what the code does. `combn` returns a matrix (in this case with two rows). I split the rows so that `[.data.table` gets passed a list, which it automatically treats as columns. You may need to study at least the introductory vignette of package data.table. `.N` is the number of rows in each group as defined in `by`. – Roland Mar 04 '21 at 10:08