1

I have a dataframe with +/- 300000 observations of 11 variables. A snapshot is given of the ones that I would like to use.

Location.type  Package.ID     Version    WeekID     Name
Office         301502         3.0        201542     William
Office         301502         2.7        201542     Claire
Production     9764933        1.6        201214     John
Home           298793         2.6        201746     Bill
Home           298793         2.5        201738     William
Production     2803789        4.2        201605     Brad
Production     2803789        4.19       201605     Richard
Production     2803789        4.18       201605     Vanessa

I want to omit the rows that have both a duplicated Package.ID and WeekID and keep the row with the highest value in Version, but keep all other information. My desired output is:

Location.type  Package.ID     Version    WeekID     Name
Office         301502         3.0        201542     William
Production     9764933        1.6        201214     John
Home           298793         2.6        201746     Bill
Home           298793         2.5        201738     William
Production     2803789        4.2        201605     Brad

My question is similar to Remove duplicates with largest absolute value. However, in that case picking the highest value depends on one column, in my case on two. Maybe this is a simple adjustment, but I could not figure it out myself.

Joey
  • 67
  • 1
  • 6
  • Multiple ways to do this, with `dplyr`, `df %>% group_by(WeekID, Package.ID) %>% summarise(Value = max(Version))` – Ronak Shah May 25 '18 at 08:25
  • Have a look at [how to delete duplicates in rows based on more than one column](https://stackoverflow.com/questions/25301476/delete-duplicate-rows-in-two-columns-simultaneously) and then apply the solution in [Remove duplicates keeping entry with largest absolute value](https://stackoverflow.com/questions/12805964/remove-duplicates-keeping-entry-with-largest-absolute-value) – nghauran May 25 '18 at 08:33
  • Try this: `df2 <- df[order(df$Package.ID, df$WeekID, -abs(df$Version)), ]` and then `df2[!duplicated(df2[c("Package.ID","WeekID")]), ]` – nghauran May 25 '18 at 08:37
  • 1
    @ANG Thanks, the combination worked! – Joey May 25 '18 at 08:40
  • Glad it helps @Joey and welcome to SO! As for me, I would recommend this solution using `dplyr`: `df %>% group_by(Package.ID, WeekID) %>% filter(Version == max(Version))` This will filter `df` by keeping the highest values per group when groups are defined as a combination of `Package.ID` and `WeekID`. It also gives exactly your desired output. – nghauran May 25 '18 at 08:54
  • @ANG I tried it too and it indeed works too, but there is a difference of 1 observation between the two solutions, with the second option having an observation less. I'll sort the out where the difference comes from. – Joey May 25 '18 at 09:26

0 Answers0