remove specific duplicate rows based on median

Question

I currently have a data frame that looks like this:

        result 1    result 2    result 3    median 
item 1    8             7           6         7 
item 5    1             2           3         2 
item 1    6             5           4         5
item 5    3             4           5         4

I want to remove the duplicates based on the median, where I want to keep the duplicate entry with the higher median. Problem with this is that the rownames (item 1, etc) are not their own columns, so it's not accessible with $ operations.

How can I accomplish this? Thanks in advance.

Could also be done with `top_n`, `df %>% group_by(row) %>% top_n(1, median)` — Ronak Shah, Feb 12 '18 at 08:42

score 5 · Accepted Answer · answered Feb 12 '18 at 08:35

5

You can simply order decreasing and remove the duplicates, i.e.

df <- df[order(df$median, decreasing = TRUE),]
df[!duplicated(df$row),]

which gives,

    row result1 result2 result3 median
1 item1       8       7       6      7
4 item5       3       4       5      4

answered Feb 12 '18 at 08:35

Sotos

51,121
6
32
66

Sorry about the confusion but the "row" column is actually just the rownames (it's not its own column) - how do I tackle this? Thanks for your help. Seems like a real easy fix. – Alex Johanssen Feb 12 '18 at 08:41
nevermind, just added another column and took care of it. thanks for your help! – Alex Johanssen Feb 12 '18 at 08:55

score 1 · Answer 2 · answered Feb 12 '18 at 08:33

We can group by 'row' and then filter the rows having the max value for 'median'

library(dplyr)
df1 %>%
   group_by(row) %>% 
   filter(median == max(median))
# A tibble: 2 x 5
# Groups: row [2]
#   row    result1 result2 result3 median
#   <chr>    <int>   <int>   <int>  <int>
#1 item 1       8       7       6      7
#2 item 5       3       4       5      4

If there are ties for max value of 'median' and we want the first row that matches, then use which.max with slice

df1 %>%
    group_by(row) %>%
    slice(which.max(median))

score 0 · Answer 3 · answered Feb 12 '18 at 08:47

Here is a solution with data.table

library("data.table")
D <- fread(
"item   result1    result2    result3    median
item1    8             7           6         7
item5    1             2           3         2
item1    6             5           4         5
item5    3             4           5         4")
D[, maxmed:=max(median), by=item][median==maxmed]

remove specific duplicate rows based on median

3 Answers3