1

I have a large table containing thousands of entries queried from a database having a structure similar to that in Table 1 in the image below. I would like to keep the duplicate row that has the highest value for Var 1, as shown in Table 2. The situation is similar to that is described in an earlier query in this forum remove duplicates based on one column and keep last entry. Selecting the rows by using a simple for loop works but, it is taking a long time to run. Is there a faster elegant way of handling this in R?

Table1 <- structure(list(Var1 = 1001:1009, Var2 = c("AAA", "BBB", "CCC", 
"AAA", "DDD", "BBB", "AAA", "EEE", "DDD"), Var3 = c(95L, 100L, 
90L, 95L, 85L, 100L, 95L, 45L, 85L), Var4 = c("mg", "kg", "pg", 
"mg", "mg", "kg", "mg", "mg", "mg")), .Names = c("Var1", "Var2", 
"Var3", "Var4"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-9L), spec = structure(list(cols = structure(list(Var1 = structure(list(), class = c("collector_integer", 
"collector")), Var2 = structure(list(), class = c("collector_character", 
"collector")), Var3 = structure(list(), class = c("collector_integer", 
"collector")), Var4 = structure(list(), class = c("collector_character", 
"collector"))), .Names = c("Var1", "Var2", "Var3", "Var4")), 
    default = structure(list(), class = c("collector_guess", 
    "collector"))), .Names = c("cols", "default"), class = "col_spec"))

enter image description here

Community
  • 1
  • 1
RanonKahn
  • 853
  • 10
  • 34

3 Answers3

4

duplicated have a fromLast option!

Table1[!duplicated(Table1$Var2, fromLast = T), ]

It should be noted that we have to sort by Var1 first if the data are not already ordered by Var1.

mt1022
  • 16,834
  • 5
  • 48
  • 71
1

We can use slice after grouping by 'Var2'

library(dplyr)
Table1 %>%
    group_by(Var2) %>%
    slice(which.max(Var1)) %>%
    arrange(Var1)
#     Var1  Var2  Var3  Var4    
#     <int> <chr> <int> <chr>
#  1  1003   CCC    90    pg 
#  2  1006   BBB   100    kg
#  3  1007   AAA    95    mg
#  4  1008   EEE    45    mg
#  5  1009   DDD    85    mg

Or do an arrange and then filter the non-duplicates

Table1 %>% 
     arrange(Var2, -Var1) %>% 
     filter(!duplicated(Var2)) %>%
     arrange(Var1)

Or with data.table

library(data.table)
setDT(Table1)[order(Var2,-Var1)][!duplicated(Var2)][order(Var1)]

NOTE: This can be done within one step using fromLast=TRUE with duplicated, but here we are not sure whether the values are already ordered or not in the original dataset. So, compact method doesn't mean that it works always

We can also use a compact code

Table1[c(3, 6:9),]

as another way to get the expected :-)

akrun
  • 874,273
  • 37
  • 540
  • 662
1

In base R, we can use ave to perform a group operation. Here a logical vector is returned which indicates if the observation is equal to the maximum value of Var1. This is used to subset.

Table1[as.logical(ave(Table1$Var1, Table1$Var2, FUN=function(i) i == max(i))),]
  Var1 Var2 Var3 Var4
3 1003  CCC   90   pg
6 1006  BBB  100   kg
7 1007  AAA   95   mg
8 1008  EEE   45   mg
9 1009  DDD   85   mg
lmo
  • 37,904
  • 9
  • 56
  • 69