1

The data I have is something like that:

RES1 <- c("A","B","A","A","B")
RES2 <- c("B","A","A","B","A")
VAL1 <-c(3,5,3,6,8)
VAL2 <- c(5,3,7,2,7)
dff <- data.frame(RES1,VAL1,RES2,VAL2)
dff
  RES1 VAL1 RES2 VAL2
  1    A    3    B    5 
  2    B    5    A    3
  3    A    3    A    7
  4    A    6    B    2
  5    B    8    A    7

I want to remove the lines where I already have the same res1-res2 pair. For example: A 3 interacts with B 5. That's the information I want. I do not care which pair is first. B 5 with A 3 or A 3 with B 5. What I want to get is the following dataframe:

output
  RES1 VAL1 RES2 VAL2
   1    A    3    B    5
   2    A    3    A    7
   3    A    6    B    2
   4    B    8    A    7

Then I want to do the same for another data frame such as :

RES3 <- c("B","B","B","A","B")
RES4 <- c("A","A","A","A","B")
VAL4 <- c(3,7,5,3,8)
VAL3 <- c(5,8,3,7,3)
df2 <- data.frame(RES3,VAL3,RES4,VAL4)

df2
  RES3 VAL3 RES4 VAL4
   1     B     5     A     3
   2     B     8     A     7
   3     B     3     A     5
   4     A     7     A     3
   5     B     3     B     8

At the end, I just want to keep mutual pairs (in my definition both pairs are the same, keeping one is essential : "A 5" - "B 3" is the same as "B 3" - "A 5". In other words, order does not matter.

Final output I desire should have the following pairs which are unique and which exist in BOTH dataframes:

mutualpairs
  RESA VALA RESB VALB
  A     3     B     5
  A     3     A     7
  B     8     A     7
wthimdh
  • 476
  • 1
  • 5
  • 19

2 Answers2

2

Using the logic in my earlier comment, you could try this

df2 <- structure(list(RES3 = c("B", "B", "B", "A", "B"), VAL3 = c(5, 8, 3, 7, 3), RES4 = c("A", "A", "A", "A", "B"), VAL4 = c(3, 7, 5, 3, 8)), .Names = c("RES3", "VAL3", "RES4", "VAL4"), row.names = c(NA, -5L), class = "data.frame")
dff <- structure(list(RES1 = c("A", "B", "A", "A", "B"), VAL1 = c(3, 5, 3, 6, 8), RES2 = c("B", "A", "A", "B", "A"), VAL2 = c(5, 3, 7, 2, 7)), .Names = c("RES1", "VAL1", "RES2", "VAL2"), row.names = c(NA, -5L), class = "data.frame")

Write a function that splits a data frame into pairs of columns and sorts the interaction pairs

f <- function(data) {
  idx <- t(apply(sapply(seq(ncol(data) / 2), function(x)
    interaction(data[, 2 * x + -1:0], sep = ' ')), 1, sort))
  apply(idx, 1, paste, collapse = ' ')
}

Eg,

f(dff)
# [1] "A 3 B 5" "A 3 B 5" "A 3 A 7" "A 6 B 2" "A 7 B 8"

Use it on both data frames and find the common ones; then you could use this index to 1) subset the original data frames or 2) since this already contains the information you want, just undo what f does, eg

dff$idx <- f(dff)
df2$idx <- f(df2)

idx <- intersect(dff$idx, df2$idx)

read.table(text = idx, col.names = c('RESA','VALA','RESB','VALB'))
#   RESA VALA RESB VALB
# 1    A    3    B    5
# 2    A    3    A    7
# 3    A    7    B    8
Community
  • 1
  • 1
rawr
  • 20,481
  • 4
  • 44
  • 78
  • what if I have more columns in the data frames? – wthimdh Apr 25 '16 at 01:54
  • @wthimdh do you only have two pairs of res/val? use `f(data[, c('res1','val1','res2','val2')])` or other subsetting methods. `f` can take any number of pairs of res/val, but it does assume that the data frame used is made up of only pairs of res/val – rawr Apr 25 '16 at 02:09
  • oh, unfortunately I have other columns as well :s but i need to use only wo pairs of res/val @rawr – wthimdh Apr 25 '16 at 02:28
  • @wthimdh that's why I suggested `f(data[, c('res1','val1','res2','val2')])` or any number of res/val pairs will work. `f(data[, grep('res|val', names(data))])` any subsetting method will work – rawr Apr 25 '16 at 06:16
2

Here is one solution:

library(dplyr)
df1$combined <- apply(df1, 1, function(x) paste(sort(c(paste(x[1], x[2], collapse = ','), paste(x[3], x[4], collapse = ','))), collapse = ','))
df2$combined <- apply(df2, 1, function(x) paste(sort(c(paste(x[1], x[2], collapse = ','), paste(x[3], x[4], collapse = ','))), collapse = ','))

df <- inner_join(df1 %>% group_by(combined) %>% slice(1), df2 %>% group_by(combined) %>% slice(1))
df <- df %>% ungroup() %>% select(RES1, VAL1, RES2, VAL2)

Giving you this output:

Source: local data frame [3 x 4]

    RES1  VAL1   RES2  VAL2
  (fctr) (dbl) (fctr) (dbl)
1      A     3      A     7
2      A     3      B     5
3      B     8      A     7

This approach does only one pass through the data in the above apply functions and relies on highly efficient dplyr group_by and inner_join.

The main task is to get the pairs of columns in a certain order so they can be then grouped by it. The first two lines are simply getting two pairs of columns 'appended' into a string that is ordered so values appearing in either pair count as same. Then, the group_by() is grouping all rows with same combined column and taking the first row (using slice). Joining the resulting two data frames using inner_join ensures only rows that are common between the two data frames are retained. The last line is simply selecting desired columns to keep. BTW - with respect to the message above, you may want to specify by argument in the inner_join call if you only want to join by combined column and nothing else.

Gopala
  • 10,363
  • 7
  • 45
  • 77
  • Does your answer also assume data frames do not have any other columns? Because I tried and it gave me error (on my real data) – wthimdh Apr 25 '16 at 02:59
  • In my code `x[1], x[2]` etc. are assuming certain column positions based on your input. You may need to adjust accordingly. Perhaps you can post the error you are seeing? – Gopala Apr 25 '16 at 03:01
  • sorry it was my typo. It just gives a warning like Joining by: c("value", "resi", "AAi", "chaini", "secstri", "resj", "AAj", "chainj", "secstrj", "numerVal", "denomVal", "combined") But it works! Thanks a lot. Would you mind explaining it? Is it creating a column including "RES1 VAL1, RES2 VAL2" and then choosing unique and common pairs? – wthimdh Apr 25 '16 at 03:14
  • I added to the answer. Hope that helps. You can execute each line of code and see what is going on. Including the individual portions within the `inner_join` call. – Gopala Apr 25 '16 at 03:22
  • would you mind giving an example of how to find unique ""RES1 VAL1, RES2 VAL2"pairs in one data frame? without comparing with another data frame? So that I understand every line and modify them when necessary.. Thanks in adnvance – wthimdh Apr 25 '16 at 04:56
  • If you want to find unique rows by those four columns, `distinct` in `dplyr` package will do the trick. Just specify those four rows and the data frame. However, in your case ,you wanted the combination of RES/VAL1 and RES2/VAL2 pairs (in any order) to be counted as unique. That makes it a bit more complicated (why I had to use the `apply` to do row wise combining and sorting). – Gopala Apr 25 '16 at 14:06