0

I have a data frame with 2 columns, ID and a category name:

     X1     X2
    1234   Metal
    1234   Metal
    1234   Plastic
    1234   Plastic
    1234   Glass
    1235   Metal
    1235   Metal
    1235   Plastic
    1235   Plastic
    1235   Glass
    1236   Glass
    1236   Glass
    1236   Metal
    1236   Metal
    1236   Plastic

I want to find the most frequent combinations and the count of those combinations of 2 (I will want combinations of 3 or 4 for a larger dataset) across the entire dataset:

    Metal, Plastic     2
    Glass, Metal       1

I tried to first generate all possible combinations of X2 by ID (X1), so I could then use dplyr to aggregate and subset the top combinations. Unfortunately, my dataset is too large for this to run efficiently. Any ideas on an easier and faster way to figure this out?

anrpet
  • 139
  • 1
  • 7
  • Have you used `filter(n_distinct())`? – papelr Jul 18 '18 at 19:51
  • 1
    OP: Your question says that you have a solution that's not efficient. Please let us see, as that will be needed to benchmark and because a [reproducible](https://stackoverflow.com/q/5963269/1422451) example is required per the [MCVE](https://stackoverflow.com/help/mcve) and [`r`](https://stackoverflow.com/tags/r/info) tag description. You can use `dput()`, `reprex::reprex()` or built-in data sets for reproducible data rather than just showing it for manual transcription by each potential answerer. – Hack-R Jul 18 '18 at 20:01
  • Can a category have a 'combination' with itself? E.g., ID `1236` has 3 `Glass` values - would `Glass, Glass` be considered a combination? – zack Jul 18 '18 at 20:26

2 Answers2

0

Here's an attempt at what I think you're getting at. You can change the top_n argument and I let categories combine with themselves, but you can just add an additional filter if that shouldn't be the case.

library(dplyr)

df %>% 
  mutate(ID = row_number()) %>%
  inner_join(., ., by = c('X1' = 'X1')) %>%
  filter(ID.x != ID.y) %>% # shouldn't count as combo with itself
  group_by(X2.x, X2.y) %>%
  summarize(n = n()) %>%
  ungroup() %>%
  top_n(5, n) %>%
  arrange(desc(n))

# A tibble: 7 x 3
  X2.x    X2.y        n
  <chr>   <chr>   <int>
1 Metal   Plastic    10
2 Plastic Metal      10
3 Glass   Metal       8
4 Metal   Glass       8
5 Glass   Plastic     6
6 Metal   Metal       6
7 Plastic Glass       6

# Tie results in more than 5 rows for top_n()

Data

df <- data.table::fread("X1     X2
1234   Metal
1234   Metal
1234   Plastic
1234   Plastic
1234   Glass
1235   Metal
1235   Metal
1235   Plastic
1235   Plastic
1235   Glass
1236   Glass
1236   Glass
1236   Metal
1236   Metal
1236   Plastic")
zack
  • 5,205
  • 1
  • 19
  • 25
0

Input

df
#      X1      X2
# 1  1234   Metal
# 2  1234   Metal
# 3  1234 Plastic
# 4  1234 Plastic
# 5  1234   Glass
# 6  1235   Metal
# 7  1235   Metal
# 8  1235 Plastic
# 9  1235 Plastic
# 10 1235   Glass
# 11 1236   Glass
# 12 1236   Glass
# 13 1236   Metal
# 14 1236   Metal
# 15 1236 Plastic

For each unique X1 variable, the count of X2 elements

 result <- table(cbind.data.frame(df$X1, df$X2))
 result
 #       df$X2
 # df$X1  Glass Metal Plastic
 #   1234     1     2       2
 #   1235     1     2       2
 #   1236     2     2       1

Printing the top two most frequent elements of X2 for each unique X1

 final <- apply(result,1, function(x) names(which(x == max(x))))
 final
 #  df$X1
 #    1234      1235      1236   
 # [1,] "Metal"   "Metal"   "Glass"
 # [2,] "Plastic" "Plastic" "Metal"
Sowmya S. Manian
  • 3,723
  • 3
  • 18
  • 30