0

I have a dataframe with rows that have duplicate rows and I want to drop those that have the lowest value possibly using dplyr, I've tried the following and it removes some duplicate rows while others for some reason remain unfortunately.

Below is an example of what the DF looks like where lowest value to be removed should be based on col2. In other words, duplicate rows with the highest values should be kept.

Current DataFrame

ID   Col1 Col2
ABA  0.65 0.66
ABB  0.65 0.66
ABB  0.65 0.77
ABC  0.55 0.88
ABC  0.14 0.14
ABC  0.15 0.50
ABD  0.25 0.60

Desired DataFrame

ID   Col1 Col2
ABA  0.65 0.66
ABB  0.65 0.77
ABC  0.55 0.88
ABD  0.25 0.60

Code Attempt

df %>% group_by(id) %>% top_n(0, Col2)

and

df <- df[order(df$id, df$Col2), ]
df <- df[ !duplicated(df$Col2), ]
vvvvv
  • 25,404
  • 19
  • 49
  • 81

1 Answers1

4

A possible solution:

library(dplyr)

df <- data.frame(
  stringsAsFactors = FALSE,
  ID = c("ABA", "ABB", "ABB", "ABC", "ABC", "ABC", "ABD"),
  Col1 = c(0.65, 0.65, 0.65, 0.55, 0.14, 0.15, 0.25),
  Col2 = c(0.66, 0.66, 0.77, 0.88, 0.14, 0.5, 0.6)
)

df %>% 
  group_by(ID) %>% 
  slice_max(Col2, n=1) %>% 
  ungroup

#> # A tibble: 4 × 3
#>   ID     Col1  Col2
#>   <chr> <dbl> <dbl>
#> 1 ABA    0.65  0.66
#> 2 ABB    0.65  0.77
#> 3 ABC    0.55  0.88
#> 4 ABD    0.25  0.6
PaulS
  • 21,159
  • 2
  • 9
  • 26
  • 1
    Thanks, didn't realize it was a duplicate question but this answer worked very smoothly was able to save to another dataframe. Just to clarify, the n=1 value here is just to remove 1 row correct? Thanks :) – machine_apprentice Jan 13 '22 at 08:33
  • Glad to have helped you, @machine_apprentice! `slice_max` returns the rows with the maxima values on `Col2`: If `n=1`, it returns the top 1 row (the one with the maximum value on `Col2`), if `n=2`, it returns the two top 2 rows, and so on... Was I enough clear? – PaulS Jan 13 '22 at 11:00