Count frequency of elements of one file in another file using R

Question

I have 2 file

file 1-

  colIDs rowIDs
    M1  M2
    M1  M3
    M3  M1
    M3  M2
    M4  M5
    M7  M6

file 2-

   Pcol       Mcol
    P1      M1,M2,M5,M6
    P2      M1,M2,M3,M5
    P3      M4,M5,M7,M6

I want to count the frequency of the file 1 pair of column 1 and column 2 in file 2 Mcol.

Expected output-

colIDs     rowIDs   freq
M1           M2      2
M1           M3      1
M3           M1      1
M3           M2      1
M4           M5      1
M7           M6      1

Welcome to stackoverflow! Please read and edit your question according to [How to make a great R reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) so that other users can help you. — pogibas, Jun 12 '19 at 09:37

score 1 · Answer 1 · answered Jun 12 '19 at 10:39

Get all combinations for every row of "Mcol", then rowbind and aggregate:

# example data
x <- read.table(text = "Pcol       Mcol
    P1      M1,M2,M5,M6
    P2      M1,M2,M3,M5
    P3      M4,M5,M7,M6", header = TRUE, stringsAsFactors = FALSE)

# split on ",", get all unique combinations
xx <- do.call(rbind.data.frame, 
              lapply(x$Mcol, function(i){
                n <- sort(unlist(strsplit(i, ",")))
                t(combn(n, 2))
              }))

# get count of all pairs
data.frame(table(paste(xx[, 1], xx[, 2], sep = ",")))
#     Var1 Freq
# 1  M1,M2    2
# 2  M1,M3    1
# 3  M1,M5    2
# 4  M1,M6    1
# 5  M2,M3    1
# 6  M2,M5    2
# 7  M2,M6    1
# 8  M3,M5    1
# 9  M4,M5    1
# 10 M4,M6    1
# 11 M4,M7    1
# 12 M5,M6    2
# 13 M5,M7    1
# 14 M6,M7    1

Ronak Shah · Answer 2 · 2019-06-12T10:38:38.223

It is a very big dplyr chain so I may have complicated certain steps but it gives the expected result.

library(dplyr)
library(tidyr)

df1 %>%
   mutate(c1 = pmin(colIDs, rowIDs), 
          c2 = pmax(colIDs, rowIDs)) %>%
   unite(newcol, c1, c2) %>%
   left_join(df2 %>%
              separate_rows(Mcol) %>%
              group_by(Pcol) %>%
              summarise(new_col = list(combn(Mcol, 2, paste0, collapse = ","))) %>%
              unnest() %>%
              separate(new_col, c("col1", "col2")) %>%
              count(col1, col2) %>%
              mutate(colIDs = pmin(col1, col2), rowIDs = pmax(col1, col2)) %>%
              unite(newcol, colIDs, rowIDs) %>% 
              select(newcol, n), by = c("newcol" = "newcol")) %>%
    select(-newcol)

#  colIDs rowIDs n
#1     M1     M2 2
#2     M1     M3 1
#3     M3     M1 1
#4     M3     M2 1
#5     M4     M5 1
#6     M7     M6 1

For df1 we sort columns colIDs and rowIDs and unite them in a new column called new_col which is later used as a key to merge with df2. In df2 we first separate the comma separated entries into different rows using separate_rows, group_by Pcol and create a list of pair of combination of Mcol values. Use separate to bring them into two different columns , count occurrences of their combination and again sort them using pmin and pmax and finally join them with df1.

Count frequency of elements of one file in another file using R

2 Answers2