How to remove duplicate rows in both using a condition in R

Question

The data I have is something like that:

RES1 <- c("A","B","A","A","B")
RES2 <- c("B","A","A","B","A")
VAL1 <-c(3,5,3,6,8)
VAL2 <- c(5,3,7,2,7)
dff <- data.frame(RES1,VAL1,RES2,VAL2)
dff
  RES1 VAL1 RES2 VAL2
  1    A    3    B    5 
  2    B    5    A    3
  3    A    3    A    7
  4    A    6    B    2
  5    B    8    A    7

I want to remove the lines where I already have the same res1-res2 pair. For example: A 3 interacts with B 5. That's the information I want. I do not care which pair is first. B 5 with A 3 or A 3 with B 5. What I want to get is the following dataframe:

output
  RES1 VAL1 RES2 VAL2
   1    A    3    B    5
   2    A    3    A    7
   3    A    6    B    2
   4    B    8    A    7

Then I want to do the same for another data frame such as :

RES3 <- c("B","B","B","A","B")
RES4 <- c("A","A","A","A","B")
VAL4 <- c(3,7,5,3,8)
VAL3 <- c(5,8,3,7,3)
df2 <- data.frame(RES3,VAL3,RES4,VAL4)

df2
  RES3 VAL3 RES4 VAL4
   1     B     5     A     3
   2     B     8     A     7
   3     B     3     A     5
   4     A     7     A     3
   5     B     3     B     8

At the end, I just want to keep mutual pairs (in my definition both pairs are the same, keeping one is essential : "A 5" - "B 3" is the same as "B 3" - "A 5". In other words, order does not matter.

Final output I desire should have the following pairs which are unique and which exist in BOTH dataframes:

mutualpairs
  RESA VALA RESB VALB
  A     3     B     5
  A     3     A     7
  B     8     A     7

Related: http://stackoverflow.com/q/25297812 Maybe worth noting that `pmax`/`pmin` is much faster than rowwise `sort`. — Frank, Apr 22 '16 at 20:56
@digEmAll I've edited the question, still can use your help. — wthimdh, Apr 24 '16 at 22:06
Ok, moving from one data.frame to two is a pretty big change (enough to make the current answers no longer valid). Maybe you should instead post as a new question. If you want, though, I'll look for someone to reopen this question. (I can't do it myself.) Not sure if digEmAll got pinged. You should probably comment under his answer, not here. — Frank, Apr 24 '16 at 22:38
@Frank I think I'll open a new question and delete this one. Does that work? — wthimdh, Apr 24 '16 at 22:44
You won't be able to delete this one since it has an upvoted answer. (Nor should you, since it might be useful to others, in principle.) But opening another is fine, I think. My advice would be: Make the new post self-contained; link here but make sure it's understandable on its own. — Frank, Apr 24 '16 at 23:04
@Frank why is this a duplicate? the linked question is a simple case of two columns which will not work in OP's case — rawr, Apr 24 '16 at 23:40
@rawr Only because I preferred it over the duplicate that was marked previously. I'd be fine with it being open really, but can't vote to open it myself now. Note that the df2 part was added after the two answers below. The OP also has a new question with that addition. — Frank, Apr 25 '16 at 01:57

digEmAll · Accepted Answer · 2016-04-22T20:46:26.180

You can use this code:

dff[!duplicated(t(apply(cbind(paste(dff$RES1,dff$VAL1),paste(dff$RES2,dff$VAL2)),1,sort))),]

Equivalent unrolled code:

v1 <- paste(dff$RES1,dff$VAL1)
v2 <- paste(dff$RES2,dff$VAL2)
mx <- cbind(v1,v2)
mxSorted <- t(apply(mx,1,sort))
duped <- duplicated(mxSorted)
dff[!duped,]

Explanation:

1) we create two character vectors v1, v2 by concatenating columns RES1-VAL1 and RES2-VAL2 (note that paste uses a space as default separator, maybe you could use another character or string to be safer (e.g. |,@,; etc...)
Result:

> v1
[1] "A 3" "B 5" "A 3" "A 6" "B 8"
> v2
[1] "B 5" "A 3" "A 7" "B 2" "A 7"

2) we bind these two vectors to form a matrix using cbind;
Result:

     [,1]  [,2] 
[1,] "A 3" "B 5"
[2,] "B 5" "A 3"
[3,] "A 3" "A 7"
[4,] "A 6" "B 2"
[5,] "B 8" "A 7"

3) we sort the values of each row of the matrix using t(apply(mx,1,sort));
by sorting the rows, we simply make identical the rows having the same values just swapped (note that final transpose is necessary since apply function always returns results on the columns).
Result:

     [,1]  [,2] 
[1,] "A 3" "B 5"
[2,] "A 3" "B 5"
[3,] "A 3" "A 7"
[4,] "A 6" "B 2"
[5,] "A 7" "B 8"

4) calling duplicated on a matrix, we get a logical vector of length = nrow(matrix), being TRUE where a row is a duplicate of a previous row, so in our case, we get:

[1] FALSE  TRUE FALSE FALSE FALSE
# i.e. the second row is a duplicate

5) finally we use this vector to filter the rows of the data.frame, getting the final result:

  RES1 VAL1 RES2 VAL2
1    A    3    B    5
3    A    3    A    7
4    A    6    B    2
5    B    8    A    7

score 0 · Answer 2 · edited May 23 '17 at 11:45

0

Possible duplicate of Remove duplicates column combinations from a dataframe in R

Adapting the answer here:

dff[!duplicated(dff[c('RES1','RES2')]),]

edited May 23 '17 at 11:45

Community

1
1

answered Apr 22 '16 at 19:41

shrgm

1,315
1
10
20

It's almost what I want. But outcome of this line is lines 1, 3, and 4 of my output dataframe. However, I also want second line. Condition 'RES1', 'VAL1' is not what I want. I want to eliminate the ones where RES1 and RES2 alternates such as line 1 and 2 in dff dataframe. – wthimdh Apr 22 '16 at 19:46
The data you have provided for `dff` seems wrong. However, I realized that you wanted to eliminate duplicates based on RES1, RES2 and not RES1, VAL1, and edited the answer accordingly. – shrgm Apr 22 '16 at 19:49
Nope. I want to keep one original pair of RES and VAL. For example if I have A 3 B 5, I do not want to keep B 5 A 3 since it;s just swapped version of the other one. A 3 interacts with B 5. That's the information I want. I dont care which pair is first. B 5 with A 3 or A 3 with B 5. So res1,res2 do not give me the correct answer either. – wthimdh Apr 22 '16 at 19:54
@wthimdh so maybe you need something like `dff[!duplicated(t(apply(dff, 1, sort))), ]` – rawr Apr 22 '16 at 19:58
@rawr would you mind what this line does? Because my actual data is way complicated than the example I put here. – wthimdh Apr 22 '16 at 20:03
@rawr: I think this is more correct: `dff[!duplicated(t(apply(rbind(paste(dff$RES1,dff$VAL1),paste(dff$RES2,dff$VAL2)),2,sort))), ]` – digEmAll Apr 22 '16 at 20:03
it gives me an error:( – wthimdh Apr 22 '16 at 20:07
1

@wthimdh: do you mean my code? it works, just pay attention when you copy/paste it since it spans multiple lines so it copies also some strange characters ;) – digEmAll Apr 22 '16 at 20:13
@digEmAll would you mind explaining this magical line WHICH WORKS JUST PERFECT? – wthimdh Apr 22 '16 at 20:16
@digEmAll yup. `dff[!duplicated(t(apply(sapply(1:2, function(x) interaction(dff[, 2 * x + -1:0])), 1, sort))), ]` might be a little more automated – rawr Apr 22 '16 at 20:25
@wthimdh: added an answer – digEmAll Apr 22 '16 at 20:38
@rawr: yes, but maybe the real case is not exaclty in a compatible format. Also, IIRC interaction creates all the combinations (also non-existing) unless you say drop=T, so we must be careful ;) – digEmAll Apr 22 '16 at 20:40
@digEmAll yes we can come up with many ways to break any particular piece of code. and creating all combinations (or not) is not relevant here – rawr Apr 22 '16 at 20:53
@rawr: yep, I was wrong about interaction it just creates more levels in the factors but it's not a problem here... (also interaction uses paste internally, so is very similar to my approach) – digEmAll Apr 22 '16 at 20:59
@digEmAll paste is better. I used interaction since `interaction(dff)` works whereas one needs `do.call(paste, dff)`, ie, `sapply(1:2, function(x) do.call(paste, dff[, 2 * x + -1:0]))` instead of the sapply above – rawr Apr 22 '16 at 21:02

How to remove duplicate rows in both using a condition in R

2 Answers2

Linked