6

I have a table with two columns namely id and item:

df <- data.frame(id=c(1,1,2,2,2,2,3,3,3,4,4,4,4,4),item=c(1,2,3,1,2,3,4,1,2,3,1,2,1,2))

I want to find the most frequent combination (order doesn't matter) of 3 items per id. So basically, n choose r where n = number of items within id and r = 3. The number of items per id varies - some have more than 3, some have less.

I am new to R and read about combn and expand.grid, but I don't know how to use them in my case (to work within each id).

"Find most frequent combination of values in a data.frame" is the closest question I found.

EDIT: The expected answer based on the example is the combination "1, 2, 3", which appears in id 2 and 4.

Community
  • 1
  • 1
DGenchev
  • 327
  • 3
  • 12
  • 3
    Please add the expected result based on the example data to your question – talat Sep 28 '16 at 13:33
  • Would not `n choose r` return a number ? You can find the number of items per `id` using @Stephen answer with `group_by(id)`. Do you want to consider all the permutations of 3 numbers, and display the most frequent one out of `n choose r` possibilities ? – paulwasit Sep 28 '16 at 14:24

3 Answers3

3

Here is one idea using dplyr

df1 <- df %>% 
        group_by(id) %>% 
        arrange(item) %>% 
        summarise(new = paste(unique(combn(item, length(unique(item)), toString)), collapse = '/'))
df1
# A tibble: 4 × 2
#     id                                             new
#  <dbl>                                           <chr>
#1     1                                            1, 2
#2     2                     1, 2, 3 / 1, 3, 3 / 2, 3, 3
#3     3                                         1, 2, 4
#4     4 1, 1, 2 / 1, 1, 3 / 1, 2, 2 / 1, 2, 3 / 2, 2, 3

names(sort(table(unlist(strsplit(df1$new, '/'))), decreasing = TRUE)[1])
#[1] "1, 2, 3"
Sotos
  • 51,121
  • 6
  • 32
  • 66
  • Thanks a lot. This is a great solution. I must be missing something, because I get the exact same `df1`, but when I run the last bit `names(...`, I get `[1] " 1, 1, 3 "` – DGenchev Sep 28 '16 at 15:55
0
library(dplyr)
grouped <- df %>% group_by(id,item) %>% summarize(count = n()) %>% arrange(desc(count))

Voila. The highest counts sorted from highest to lowest.

EDIT: Just realized I didn't fully answer your question. I hope I gave you a good start.

Stephen
  • 324
  • 2
  • 9
  • Thanks for the quick answer. I just added the expected result as requested. I am looking for the most frequent combination of items, not the most frequent item. – DGenchev Sep 28 '16 at 14:05
0

I think this is what you want using base R (no package needed):

a <- aggregate(item~id, df, unique)
a <- lapply(a$item, 'length<-', max(lengths(a$item)))
m <- matrix(unlist(a), ncol=3, byrow = T)
m <- t(apply(m,1,function(x) sort(x,na.last = T)))

#     [,1] [,2] [,3]
#[1,]    1    2   NA
#[2,]    1    2    3
#[3,]    1    2    4
#[4,]    1    2    3

Once we get matrix m, the most frequent row of the matrix is what you want:

t <- table(apply(m, 1, paste, collapse = "/")) 
as.numeric(strsplit(names(which.max(t)), "/")[[1]]) 

#[1] 1 2 3
989
  • 12,579
  • 5
  • 31
  • 53
  • Thanks for the answer, but I don't see how aggregate takes care of the different combinations possible within an id. It seems to take just the first 3 items in each. – DGenchev Sep 28 '16 at 15:52
  • @DGenchev updated my post. As it looks for unique elements per `id`, the order you see is by chance and if you change the order in the original `df`, you see it works as expected. – 989 Sep 28 '16 at 16:06
  • OK, I think I get it now. However, it will fall over if one group happens to have more items than all others (i.e., all other groups will have NAs and the one with more items will be put on top). I tested it by adding one more record with `id = 4` and `item = 5` – DGenchev Sep 28 '16 at 16:41