How can I remove the duplicate rows in R

Question

In my df, I define c('apple', 'banana') and c('banana', 'apple') are the same, casue the fruit type is the same just the arrangement is different.

Then, How can I remove row No.1 and row No.2 and only keep the last row(wanted_df).

df = data.frame(fruit1 = c('apple', 'banana', 'fig'),
                fruit2 = c('banana', 'apple', 'cherry'))
df

wanted_df = df[3,]

Any help will be high appreciated!

============================

Something wrong with my real data.

The frames2 loses rows which lag = 2. I wanted data frame shold like wanted_frames.

pollution1 = c('pm2.5', 'pm10', 'so2', 'no2', 'o3', 'co')
pollution2 = c('pm2.5', 'pm10', 'so2', 'no2', 'o3', 'co') 
dis = 'n'
lag = 1:2

frames = expand.grid(pollution1 = pollution1, 
                     pollution2 = pollution2,
                     dis = dis, 
                     lag = lag) %>% 
  mutate(pollution1 = as.character(pollution1),
         pollution2 = as.character(pollution2), 
         dis = as.character(dis)) %>% 
  as_tibble() %>% 
  filter(pollution1 != pollution2)

vec<- with(frames, paste(pmin(pollution1, pollution2), pmax(pollution1, pollution2)))

frames2 = frames[!duplicated(vec), ]

wanted_frames = frames2 %>% mutate(lag = 2) %>% bind_rows(frames2)

Could you show an expected output? How what you like `frames2` to appear, if you just showed a manual example. — cmirian, Feb 19 '21 at 08:35
@ cmirian, Hi, the last code `wanted_frames` is my expected output. — zhiwei li, Feb 19 '21 at 08:37
`pollution1` and `pollution2` are identical. So if you apply `filter` that omits duplicates, you gonna end up with zero rows. I am not entirely sure what you are trying to achieve. — cmirian, Feb 19 '21 at 08:38

cmirian · Answer 1 · 2021-02-19T15:27:34.063

3

Try this.

library(dplyr)
d <- filter(df, !(fruit1 %in% fruit2) | !(fruit2 %in% fruit1))

Which gives

> d
  fruit1 fruit2
1    fig cherry

Update

As commented by @JonSpring and @Phil, the updated code should be

df %>% rowwise() %>% filter(!(fruit1 %in% fruit2) | !(fruit2 %in% fruit1))%>% ungroup()

edited Feb 19 '21 at 15:27

answered Feb 19 '21 at 07:18

cmirian

2,572
3
19
59

2

Such a simple idea. Shouldn't it be `filter(df, !(fruit1 %in% fruit2) | !(fruit2 %in% fruit1))`? – Phil Feb 19 '21 at 07:30
Sure, thank you @Phil - updated accordingly. Have a great weekend. – cmirian Feb 19 '21 at 07:55
2

I don't believe this works in all cases, e.g for `df = data.frame(fruit1 = c('apple', 'cherry', 'banana', 'fig'), fruit2 = c('banana', 'apple', 'apple', 'cherry'))`. In that case row 2 is a unique combination, but is filtered out b/c one of the elements is found in the other column in another row. – Jon Spring Feb 19 '21 at 08:52
1

@JonSpring is correct - should be fixed with `df %>% rowwise() %>% filter(...) %>% ungroup()` but it could make it slower. – Phil Feb 19 '21 at 15:24

score 2 · Answer 2 · answered Feb 19 '21 at 07:14

2

A base R way :

vec<- with(df, paste(pmin(fruit1, fruit2), pmax(fruit1, fruit2)))
df[!(duplicated(vec) | duplicated(vec, fromLast = TRUE)), ]

#   fruit1 fruit2
#3    fig cherry

answered Feb 19 '21 at 07:14

Ronak Shah

377,200
20
156
213

@ Ronak Shah, Thanks for your reply，but something wrong when I use your method in my real data, and I update my question. – zhiwei li Feb 19 '21 at 08:02
@zhiweili 1) You have not used complete code of my answer. 2) For your shared dataframe all the values are duplicates so everything is removed from the data. – Ronak Shah Feb 19 '21 at 09:09
Hi, @ Ronak Shah, I have a new quesiton post on https://stackoverflow.com/questions/72023327/can-not-use-pivot-longer-in-r-with-multile-cell-value-in-r. I think may be you could help me. Thanks a lot. – zhiwei li Apr 27 '22 at 04:37

score 1 · Answer 3 · answered Feb 19 '21 at 08:58

1

Here's a low-tech dplyr approach. Make a sorted key, then keep rows with unique keys.

library(dplyr)
df %>%
    mutate(key = paste(pmin(fruit1, fruit2), pmax(fruit1, fruit2))) %>%
    add_count(key) %>%
    filter(n == 1)

answered Feb 19 '21 at 08:58

Jon Spring

55,165
4
35
53

How can I remove the duplicate rows in R

3 Answers3