Removing duplicates based on greatest value of another columns

Question

Have purchases of multiple parking passes per student per semester. Trying to delete them based which permit they had the longest. Combined student ID and semester in order to make the unique ID. There are 80 columns in the dataframe but for this purpose I am only dealing with 2.

UniqueID       PermitDuration
01-Fall16              10
01-Fall17              11
01-Fall17              5
01-Fall17              11
02-Fall16              5
02-Fall16              12
02-Fall17              2
02-Fall17              6

I want to remove the duplicates of the UniqueID based on which one has the largest value in PermitDuration. The desired dataframe would look like this.

UniqueID       PermitDuration
01-Fall16              10
01-Fall17              11
02-Fall16              12
02-Fall17              6

This from [here](https://stats.stackexchange.com/questions/11193/how-do-i-remove-all-but-one-specific-duplicate-record-in-an-r-data-frame): `unique(subset(df, PermitDuration==ave(PermitDuration, UniqueID, FUN=max)))` — M--, Dec 13 '17 at 19:18
Still trying to determine the common element, but that solution only removes a few of the duplicates while leaving most of them behind. — Xanman, Dec 13 '17 at 20:16

A5C1D2H2I1M1N2O1R2T1 · Answer 1 · 2017-12-13T19:22:51.707

2

You can make use of max. Here's an example using data.table:

> library(data.table)
> setDT(mydf)[, PermitDuration[max(PermitDuration)], UniqueID]
    UniqueID V1
1: 01-Fall16 10
2: 01-Fall17 11
3: 02-Fall16 12
4: 02-Fall17  6

Or even with aggregate:

aggregate(PermitDuration ~ UniqueID, mydf, max)

edited Dec 13 '17 at 19:22

answered Dec 13 '17 at 19:16

A5C1D2H2I1M1N2O1R2T1

190,393
28
405
485

1

`x[max(x)]` or `x[which.max(x)]` ? I haven't tried to read the OP's data in, but the former seems weird. – Frank Dec 13 '17 at 19:56
Is there an argument I can pass to the aggregate method that will allow me to keep the other 78 columns (variables)? – Xanman Dec 13 '17 at 20:14
1

Was able to answer that with some other posts and your help. Ended up using as.data.table(df)[ , .SD[which.max(PermitDuration)], by= UniqueID] – Xanman Dec 13 '17 at 20:29

score 1 · Answer 2 · answered Dec 13 '17 at 19:18

1

With dplyr:

df %>% distinct %>% group_by(UniqueID) %>% top_n(1, PermitDuration)

answered Dec 13 '17 at 19:18

David Klotz

2,401
1
7
16

Removing duplicates based on greatest value of another columns

2 Answers2