Remove duplicates based on two columns, keep one with a larger value on third column while keeping all columns intact

Question

I have a dataframe as follows:

x1    x2    x3    x4
X     A     B     4
Y     C     B     2
Z     A     C     1
X     C     B     5

I want to remove such rows that are duplicates based on the values of columns x2 and x3. The one with a bigger value on column x4 will be kept. So in this case the result would be the following:

x1    x2    x3    x4
X     A     B     4
Z     A     C     1
X     C     B     5

The second row was removed because the pairs (C, B) appeared twice and the fourth row was kept because 5 > 2 on column x4.

I have the following code that seems to remove the duplicates correctly but does not keep the column x1 in the end result:

result <- df%>%group_by(x2, x3)%>%summarise(x4=max(x4))

x2    x3    x4
A     B     4
A     C     1
C     B     5

How can all columns x1 ... x4 be kept in the end result?

Also a two liner not provided in the referenced duplicate link: Here we do not to ungroup: `library(dplyr) df %>% arrange(desc(x4)) %>% distinct(x2,x3, .keep_all = TRUE)` — TarJae, Nov 01 '21 at 14:29

score 5 · Accepted Answer · answered Nov 01 '21 at 13:41

5

You can group by x2 and x3 and use slice(), i.e.

library(dplyr)

df %>% 
 group_by(x2, x3) %>% 
 slice(which.max(x4))

# A tibble: 3 x 4
# Groups:   x2, x3 [3]
  x1    x2    x3       x4
  <chr> <chr> <chr> <int>
1 X     A     B         4
2 Z     A     C         1
3 X     C     B         5

answered Nov 01 '21 at 13:41

Sotos

51,121
6
32
66

1

There's also now `dplyr::slice_max` – camille Nov 01 '21 at 16:06

Matt · Answer 2 · 2021-11-01T13:49:35.180

1

library(dplyr)

df %>% 
  group_by(x2,x3) %>% 
  filter(x4 == max(x4)) 

# A tibble: 3 x 4
# Groups:   x2, x3 [3]
  x1    x2    x3       x4
  <chr> <chr> <chr> <int>
1 X     A     B         4
2 Z     A     C         1
3 X     C     B         5

Sotos answer is more concise, but here's another way to do it.

edited Nov 01 '21 at 13:49

answered Nov 01 '21 at 13:41

Matt

7,255
2
12
34

Remove duplicates based on two columns, keep one with a larger value on third column while keeping all columns intact

2 Answers2