Remove duplicates from dataset based on criteria

Question

I have a dataset of scores:

    ID      Sub     Score
    1       Mat     45
    2       Mat     34
    3       Mat     67
    1       Mat     43
    2       Mat     34
    4       Mat     22
    5       Sci     78
    6       Mat     32
    1       Mat     56
    1       Sci     40

I want to output only the top scores for each ID in each subject. For example, the new list should show:

    ID      Sub     Score
    2       Mat     34
    3       Mat     67
    4       Mat     22
    5       Sci     78
    6       Mat     32
    1       Mat     56
    1       Sci     40

I can find the duplicated results through:

results[duplicated(results[, c(1,2)]),]

How to order the results and delete the lowest scoring ones?

score 3 · Answer 1 · answered Feb 06 '16 at 16:37

3

Or aggregate

aggregate(Score ~ ID + Sub, df, max)

#ID Sub Score
#1  1 Mat    56
#2  2 Mat    34
#3  3 Mat    67
#4  4 Mat    22
#5  6 Mat    32
#6  1 Sci    40
#7  5 Sci    78

answered Feb 06 '16 at 16:37

Ronak Shah

377,200
20
156
213

akrun · Accepted Answer · 2016-02-06T16:44:51.627

2

There are many ways to get the expected output. One option with dplyr would be to group by 'ID', 'Sub' columns, get the top Score observation with top_n, and if there are duplicate rows, use distinct.

library(dplyr)
df1 %>% 
  group_by(ID, Sub) %>%
  top_n(1) %>% 
  distinct()
  ID   Sub Score
#   (int) (chr) (int)
#1     2   Mat    34
#2     3   Mat    67
#3     4   Mat    22
#4     5   Sci    78
#5     6   Mat    32
#6     1   Mat    56
#7     1   Sci    40

Or with data.table, we convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'ID', 'Sub', we order the 'Score' in descending and subset the first row of each group combination (.SD[1L] or head(.SD, 1) can be used).

library(data.table) 
setDT(df1)[order(-Score), .SD[1L] ,.(ID, Sub)]

Or another option is unique after we order by the columns so that it will only select the first observation for each duplicate.

unique(setDT(df1)[order(ID, Sub,-Score)], by = c('ID', 'Sub'))

Or with base R, we order the columns, and use duplicated to remove the rows that are duplicates for the first two columns.

df2 <- df1[with(df1, order(ID, Sub, -Score)),]
df2[!duplicated(df2[1:2]),]

edited Feb 06 '16 at 16:44

answered Feb 06 '16 at 16:31

akrun

874,273
37
540
662

Hmm. I wonder whether `.SD[1L]` or `unique` is faster, given `.SD[1L]` was recently optimized IIRC – MichaelChirico Feb 06 '16 at 16:38
@MichaelChirico In a recent benchmark [here](http://stackoverflow.com/questions/35223311/r-data-table-vs-mergeaggregate-performance/35223362#35223362), I find `.SD` to be very slow (but I did that using 1.9.6. It might have changed altogether in the devel version). To my knowledge, `unique` is fast. It would be good to have some benchmarks with the devel version. – akrun Feb 06 '16 at 16:40
1

Yea, this was done somewhat recently, Arun has been doing a lot of work on GForce optimizing of late; I think [this](https://github.com/Rdatatable/data.table/commit/751baff13114fb8dff972662cde11fec10d0f16e) is the commit that makes me think `.SD[1L]` could be competitive – MichaelChirico Feb 06 '16 at 16:42

Remove duplicates from dataset based on criteria

2 Answers2