I'm working with a large dataframe with two variables of interest: id and date. I would like to eliminate every row with duplicated id, but I want to keep the row with the latest date. I have been using dplyr::distinct
but I can't figure out how to add this date condition.
Asked
Active
Viewed 119 times
-2

Edward
- 10,360
- 2
- 11
- 26

Alvaro Acedo
- 23
- 1
- 5
2 Answers
1
set.seed(42)
dat <- tibble(id=sample(LETTERS[1:3], size=100, replace=TRUE), date=sample(10, size=100, replace=TRUE))
dat
# # A tibble: 100 x 2
# id date
# <chr> <int>
# 1 A 8
# 2 A 7
# 3 A 6
# 4 A 1
# 5 B 5
# 6 B 9
# 7 B 7
# 8 A 10
# 9 C 10
# 10 C 10
# # ... with 90 more rows
dat %>%
group_by(id) %>%
slice(which.max(date))
# # A tibble: 3 x 2
# # Groups: id [3]
# id date
# <chr> <int>
# 1 A 10
# 2 B 10
# 3 C 10

r2evans
- 141,215
- 6
- 77
- 149
0
An approach with janitor::get_dupes
library(tidyverse)
library(janitor)
#>
#> Attaching package: 'janitor'
#> The following objects are masked from 'package:stats':
#>
#> chisq.test, fisher.test
set.seed(42)
dat <- tibble(id=sample(LETTERS[1:3], size=100, replace=TRUE), date=sample(10, size=100, replace=TRUE))
dat
#> # A tibble: 100 x 2
#> id date
#> <chr> <int>
#> 1 A 8
#> 2 A 7
#> 3 A 6
#> 4 A 1
#> 5 B 5
#> 6 B 9
#> 7 B 7
#> 8 A 10
#> 9 C 10
#> 10 C 10
#> # … with 90 more rows
dat %>%
get_dupes(id) %>%
group_by(id) %>%
arrange(desc(date)) %>%
slice(1)
#> # A tibble: 3 x 3
#> # Groups: id [3]
#> id dupe_count date
#> <chr> <int> <int>
#> 1 A 40 10
#> 2 B 39 10
#> 3 C 21 10
Created on 2020-07-08 by the reprex package (v0.3.0)

phiggins
- 248
- 1
- 7