Is there a way to remove duplicates based on two columns but keep the one with highest number in the third column?

Question

I would like to take this dataset and remove the values if they have the same id and age(duplicates) but keep the one with the highest month number.

ID|Age|Month|
1   25   7
1   25  12
2   18  10
2   18  11
3   12  10
3   25  10
4   19  10
5   10  2
5   10  3

And have the outcome be

ID|Age|Month
 1   25   12
 2   18   11
 3   12   10
 3   25   10
 4   19   10
 5   10    3

Note that it removed the duplicates but kept the version with the highest month number.

score 1 · Accepted Answer · answered Feb 11 '21 at 15:38

as a solution option

library(tidyverse)
df <- read.table(text = "ID Age Month 
1   25   7
1   25  12
2   18  10
2   18  11
3   12  10
3   25  10
4   19  10
5   10  2
5   10  3", header = T)

df %>% 
  group_by(ID, Age) %>% 
  slice_max(Month)
#> # A tibble: 6 x 3
#> # Groups:   ID, Age [6]
#>      ID   Age Month
#>   <int> <int> <int>
#> 1     1    25    12
#> 2     2    18    11
#> 3     3    12    10
#> 4     3    25    10
#> 5     4    19    10
#> 6     5    10     3

^{Created on 2021-02-11 by the reprex package (v1.0.0)}

score 0 · Answer 2 · answered Feb 11 '21 at 15:39

Using dplyr package, the solution:

df %>% 
+   group_by(ID, Age) %>%
+   filter(Month == max(Month))
# A tibble: 6 x 3
# Groups:   ID, Age [6]
     ID   Age Month
  <dbl> <dbl> <dbl>
1     1    25    12
2     2    18    11
3     3    12    10
4     3    25    10
5     4    19    10
6     5    10     3

Is there a way to remove duplicates based on two columns but keep the one with highest number in the third column?

2 Answers2