How to remove duplicates based on the combinations of two columns

Question

I have a dataframe like below.

df = expand.grid(A = c('a', 'b', 'c', 'd'),
                B = c('a', 'b', 'c', 'd'))


A   B
a   a           
b   a           
c   a           
d   a           
a   b           
b   b           
c   b           
d   b           
a   c           
b   c

What I need to do is to remove duplicates based on COMBINATION of two column values. For example, when row1 is 'a', 'b' and row2 is 'b', 'a'. They are considered as duplicates. I need to remove one of them. Removing duplicates of two columns is easy. But in this case, how can I remove duplicates based on their combinations? I could not figure out how. Thanks a lot in advance.

BENY · Answer 1 · 2018-05-01T14:44:26.253

2

You can using duplicated with apply sort

df[!duplicated(data.frame(t(apply(df,1,sort)))),]
   A B
1  a a
3  c a
5  a b
7  c b
9  a c
11 c c
13 a d
15 c d

edited May 01 '18 at 14:44

answered May 01 '18 at 13:42

BENY

317,841
20
164
234

This works. When I looked at the result after lapply(df, sort) , it is a list of two vectors(columns) with both sorted, which essentially become the same. I still do not understand why it works. Could you explain a little further? Thanks a lot! – zesla May 01 '18 at 14:19
@zesla lapply, will sort row by row , and create the vector again , cause the sort key all same , so a,b and b,a will all end up with a,b , then we just drop duplicate , we get the unique (Notice lapply will not change the order of the original dataframe, so you can using Boolean from the lapply result to filter the original df ) – BENY May 01 '18 at 14:25
sorry to bother you. That is what I thought. But when I type lappy(df, sort), the output is $A aaaabbbbccccdddd $B aaaabbbbccccdddd ? seems like sorted by column..... @Wen – zesla May 01 '18 at 14:31
@zesla that is just vector print way , you can type data.frame(lapply(df,sort)), to see the result – BENY May 01 '18 at 14:32
It's the same. the first 4 rows are all 'a', 'a' .... I'm confused. Also, in r, dataframe is a list of columns. It seems that lapply should sort by columns not rows.... I might be wrong..... @Wen – zesla May 01 '18 at 14:41
@zesla let us do the most simple way `data.frame(t(apply(df,1,sort)))` with apply – BENY May 01 '18 at 14:42
yes. This looks right. but not lapply(df, sort) . Sorry, Wen, I'm just trying to understand why your lappy method works.... @Wen – zesla May 01 '18 at 15:04
@zesla no worry the did the same job : -) apply more easy to understand :-) – BENY May 01 '18 at 15:06

How to remove duplicates based on the combinations of two columns

1 Answers1

Linked