In case of duplicated value in variable, keep row with lowest value based on other variable

Question

Let's say I have

There are unique and duplicated values in df$ID. If a value in df$ID is duplicated, I want to keep the row with the lowest df$tti value.

Expected output

> df
   ID  tti
1 118  1.4
2   9  2.7
3  12  1.2

I am looking for a solution in dplyr

Data

df <- structure(list(ID = c(118L, 118L, 118L, 9L, 12L, 12L), tti = c(11.2, 
1.4, 9.2, 2.7, 1.2, 82.8)), class = "data.frame", row.names = c(NA, 
-6L))

akrun · Accepted Answer · 2021-03-16T22:50:57.280

2

We can use slice_min after grouping by 'ID'

library(dplyr)
df %>%
    group_by(ID) %>%
    slice_min(tti) %>%
    ungroup

-output

# A tibble: 3 x 2
#     ID   tti
#  <int> <dbl>
#1     9   2.7
#2    12   1.2
#3   118   1.4

Or with collapse

library(collapse)
df %>%
    fgroup_by(ID) %>%
    fsummarise(tti = fmin(tti))

#   ID tti
#1   9 2.7
#2  12 1.2
#3 118 1.4

Or another option is roworder (which is faster than arrange from dplyr) with funique

roworder(df, ID, tti) %>%
     funique(cols = 1)
#    ID tti
#1   9 2.7
#2  12 1.2
#3 118 1.4

edited Mar 16 '21 at 22:50

answered Mar 16 '21 at 20:41

akrun

874,273
37
540
662

How would one do it if `tti` was a variable in `Date` format, in order to take the most recent row ? – Julien Jul 26 '22 at 14:42
1

@Julien If it is `Date` class, the code should work as `min/max` works for the integer storage mode for Date – akrun Jul 26 '22 at 15:15

score 1 · Answer 2 · answered Mar 16 '21 at 20:41

By default distinct keeps the first row of duplicated ID. So if the data are sorted first by tti within ID then the first row will also be the smallest value of tti. By default arrange orders data in ascending order.

library(dplyr)
df %>% 
  dplyr::arrange(ID, tti) %>% 
  dplyr::distinct(ID, .keep_all = T)

Output

   ID tti
1   9 2.7
2  12 1.2
3 118 1.4

In case of duplicated value in variable, keep row with lowest value based on other variable

2 Answers2