2

I have a dataframe as follows:

x1    x2    x3    x4
X     A     B     4
Y     C     B     2
Z     A     C     1
X     C     B     5

I want to remove such rows that are duplicates based on the values of columns x2 and x3. The one with a bigger value on column x4 will be kept. So in this case the result would be the following:

x1    x2    x3    x4
X     A     B     4
Z     A     C     1
X     C     B     5

The second row was removed because the pairs (C, B) appeared twice and the fourth row was kept because 5 > 2 on column x4.

I have the following code that seems to remove the duplicates correctly but does not keep the column x1 in the end result:

result <- df%>%group_by(x2, x3)%>%summarise(x4=max(x4))

x2    x3    x4
A     B     4
A     C     1
C     B     5

How can all columns x1 ... x4 be kept in the end result?

Aweq
  • 23
  • 4
  • Also a two liner not provided in the referenced duplicate link: Here we do not to ungroup: `library(dplyr) df %>% arrange(desc(x4)) %>% distinct(x2,x3, .keep_all = TRUE)` – TarJae Nov 01 '21 at 14:29

2 Answers2

5

You can group by x2 and x3 and use slice(), i.e.

library(dplyr)

df %>% 
 group_by(x2, x3) %>% 
 slice(which.max(x4))

# A tibble: 3 x 4
# Groups:   x2, x3 [3]
  x1    x2    x3       x4
  <chr> <chr> <chr> <int>
1 X     A     B         4
2 Z     A     C         1
3 X     C     B         5
Sotos
  • 51,121
  • 6
  • 32
  • 66
1
library(dplyr)

df %>% 
  group_by(x2,x3) %>% 
  filter(x4 == max(x4)) 

# A tibble: 3 x 4
# Groups:   x2, x3 [3]
  x1    x2    x3       x4
  <chr> <chr> <chr> <int>
1 X     A     B         4
2 Z     A     C         1
3 X     C     B         5

Sotos answer is more concise, but here's another way to do it.

Matt
  • 7,255
  • 2
  • 12
  • 34