1

I have 2 file

file 1-

  colIDs rowIDs
    M1  M2
    M1  M3
    M3  M1
    M3  M2
    M4  M5
    M7  M6

file 2-

   Pcol       Mcol
    P1      M1,M2,M5,M6
    P2      M1,M2,M3,M5
    P3      M4,M5,M7,M6

I want to count the frequency of the file 1 pair of column 1 and column 2 in file 2 Mcol.

Expected output-

colIDs     rowIDs   freq
M1           M2      2
M1           M3      1
M3           M1      1
M3           M2      1
M4           M5      1
M7           M6      1
xrxrxrxxr
  • 89
  • 1
  • 7
  • 2
    Welcome to stackoverflow! Please read and edit your question according to [How to make a great R reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) so that other users can help you. – pogibas Jun 12 '19 at 09:37

2 Answers2

1

Get all combinations for every row of "Mcol", then rowbind and aggregate:

# example data
x <- read.table(text = "Pcol       Mcol
    P1      M1,M2,M5,M6
    P2      M1,M2,M3,M5
    P3      M4,M5,M7,M6", header = TRUE, stringsAsFactors = FALSE)

# split on ",", get all unique combinations
xx <- do.call(rbind.data.frame, 
              lapply(x$Mcol, function(i){
                n <- sort(unlist(strsplit(i, ",")))
                t(combn(n, 2))
              }))

# get count of all pairs
data.frame(table(paste(xx[, 1], xx[, 2], sep = ",")))
#     Var1 Freq
# 1  M1,M2    2
# 2  M1,M3    1
# 3  M1,M5    2
# 4  M1,M6    1
# 5  M2,M3    1
# 6  M2,M5    2
# 7  M2,M6    1
# 8  M3,M5    1
# 9  M4,M5    1
# 10 M4,M6    1
# 11 M4,M7    1
# 12 M5,M6    2
# 13 M5,M7    1
# 14 M6,M7    1
zx8754
  • 52,746
  • 12
  • 114
  • 209
0

It is a very big dplyr chain so I may have complicated certain steps but it gives the expected result.

library(dplyr)
library(tidyr)

df1 %>%
   mutate(c1 = pmin(colIDs, rowIDs), 
          c2 = pmax(colIDs, rowIDs)) %>%
   unite(newcol, c1, c2) %>%
   left_join(df2 %>%
              separate_rows(Mcol) %>%
              group_by(Pcol) %>%
              summarise(new_col = list(combn(Mcol, 2, paste0, collapse = ","))) %>%
              unnest() %>%
              separate(new_col, c("col1", "col2")) %>%
              count(col1, col2) %>%
              mutate(colIDs = pmin(col1, col2), rowIDs = pmax(col1, col2)) %>%
              unite(newcol, colIDs, rowIDs) %>% 
              select(newcol, n), by = c("newcol" = "newcol")) %>%
    select(-newcol)

#  colIDs rowIDs n
#1     M1     M2 2
#2     M1     M3 1
#3     M3     M1 1
#4     M3     M2 1
#5     M4     M5 1
#6     M7     M6 1

For df1 we sort columns colIDs and rowIDs and unite them in a new column called new_col which is later used as a key to merge with df2. In df2 we first separate the comma separated entries into different rows using separate_rows, group_by Pcol and create a list of pair of combination of Mcol values. Use separate to bring them into two different columns , count occurrences of their combination and again sort them using pmin and pmax and finally join them with df1.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213