-2

I'm working with a large dataframe with two variables of interest: id and date. I would like to eliminate every row with duplicated id, but I want to keep the row with the latest date. I have been using dplyr::distinct but I can't figure out how to add this date condition.

Edward
  • 10,360
  • 2
  • 11
  • 26
Alvaro Acedo
  • 23
  • 1
  • 5

2 Answers2

1
set.seed(42)
dat <- tibble(id=sample(LETTERS[1:3], size=100, replace=TRUE), date=sample(10, size=100, replace=TRUE))
dat
# # A tibble: 100 x 2
#    id     date
#    <chr> <int>
#  1 A         8
#  2 A         7
#  3 A         6
#  4 A         1
#  5 B         5
#  6 B         9
#  7 B         7
#  8 A        10
#  9 C        10
# 10 C        10
# # ... with 90 more rows

dat %>%
  group_by(id) %>%
  slice(which.max(date))
# # A tibble: 3 x 2
# # Groups:   id [3]
#   id     date
#   <chr> <int>
# 1 A        10
# 2 B        10
# 3 C        10
r2evans
  • 141,215
  • 6
  • 77
  • 149
0

An approach with janitor::get_dupes

library(tidyverse)
library(janitor)
#> 
#> Attaching package: 'janitor'
#> The following objects are masked from 'package:stats':
#> 
#>     chisq.test, fisher.test

set.seed(42)
dat <- tibble(id=sample(LETTERS[1:3], size=100, replace=TRUE), date=sample(10, size=100, replace=TRUE))
dat
#> # A tibble: 100 x 2
#>    id     date
#>    <chr> <int>
#>  1 A         8
#>  2 A         7
#>  3 A         6
#>  4 A         1
#>  5 B         5
#>  6 B         9
#>  7 B         7
#>  8 A        10
#>  9 C        10
#> 10 C        10
#> # … with 90 more rows

dat %>% 
  get_dupes(id) %>% 
  group_by(id) %>% 
  arrange(desc(date)) %>% 
  slice(1)
#> # A tibble: 3 x 3
#> # Groups:   id [3]
#>   id    dupe_count  date
#>   <chr>      <int> <int>
#> 1 A             40    10
#> 2 B             39    10
#> 3 C             21    10

Created on 2020-07-08 by the reprex package (v0.3.0)

phiggins
  • 248
  • 1
  • 7