0

In the original dataset I have 3k+ rows and 2 columns - ids and languages that id can apply in practice. My first step was to find the frequency combinations of chosen languages. For e.g., how many times Python was chosen along with R, SQL; or how many times Java was picked with JavaScript, C++ and so on.

Some research on Stackoverflow helped me to find these possible patterns. Here's some code with a sample data set:

sample <- data.frame(id = rep(randomNames::randomNames(4), each = 4),
                     programming = c("R", "Python", "C#", "Other",
                                     "R", "Tableu", "Assembler",
                                     "Other", "Java", "JavaScript",
                                     "Python", "C#","R", "Python", "C#",
                                     "Other"))
gr <- sample %>%
  group_by(id) %>%
  arrange(programming) %>%
  summarise(programming = paste(sort(unique(programming)), collapse = ", ")) %>%
  count(programming)

But now I wonder how can I find the number of the most frequent picks for each language. For instance, R language was picked with Java and Kotlin very few times, this is not a very popular setting. But R that was picked with Python and SQL is more popular. And my purpose is to find what languages has the greatest frequency of being picked.

I also did some research (example), and, unfortunately, have not found the solution.

I think I should iterate my programming column to find all possible picks (R + ..., Python + ...; then R + Python + ...). I tried using lapply but struggled with writing a lambda function.

What are the possible ways to solve the issue? Is there any effective function for such purposes?

rg4s
  • 811
  • 5
  • 22
  • 1
    You could read up on the `arules` package (from its description "Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules)." – dario Feb 12 '21 at 07:52
  • @dario thank you! I'll do some research on this package, have not heard about it. – rg4s Feb 12 '21 at 08:17

1 Answers1

1

One option would be to create combinations of languages within each id and count the combinations which most frequently occur together. .

library(dplyr)

sample %>%
  group_by(id) %>%
  summarise(programming = combn(sort(programming), 2, 
                                paste0, collapse = '-'), .groups = 'drop') %>%
  count(programming, sort = TRUE)

#   programming           n
#   <chr>             <int>
# 1 C#-Python             3
# 2 Other-R               3
# 3 C#-Other              2
# 4 C#-R                  2
# 5 Other-Python          2
# 6 Python-R              2
# 7 Assembler-Other       1
# 8 Assembler-R           1
# 9 Assembler-Tableu      1
#10 C#-Java               1
#11 C#-JavaScript         1
#12 Java-JavaScript       1
#13 Java-Python           1
#14 JavaScript-Python     1
#15 Other-Tableu          1
#16 R-Tableu              1
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213