0

I have data with columns "ID" and "value" in which ID might be repeated. I would like to find all rows which have duplicate IDs and just keep the one with the higher value.

mydf <- data.frame(ID = c(1,2,2,3,4), value = c(5, 8, 20, 18,15))

I am working w dplyr. So far I can find the duplicates

find_dup <- function(dataset, var) {
  dataset %>% group_by({{var}}) %>% filter(n() >1) %>% ungroup %>% arrange({{var}})
}
find_dup(mydf, ID)

But am having trouble with the replace step, not sure how to "point to" the larger value. Hoping to stay with a tidyverse solution for now if possible. Any thoughts welcome, Thx!

marcel
  • 389
  • 1
  • 8
  • 21

1 Answers1

0

Rather than specifically identifying and removing duplicates, you could group_by ID and slice_max the top value in each group.

library(dplyr)

mydf <- data.frame(ID = c(1, 2, 2, 3, 4), value = c(5, 8, 20, 18, 15))

mydf %>% 
  group_by(ID) %>% 
  slice_max(value, n = 1) %>%
  ungroup()
#> # A tibble: 4 x 2
#>      ID value
#>   <dbl> <dbl>
#> 1     1     5
#> 2     2    20
#> 3     3    18
#> 4     4    15

Created on 2023-08-07 with reprex v2.0.2

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87