Remove duplicates based on 2nd column condition

Question

I am trying to remove duplicate rows from a data frame based on the max value on a different column

So, for the data frame:

df<-data.frame (rbind(c("a",2,3),c("a",3,4),c("a",3,5),c("b",1,3),c("b",2,6),c("r",4,5))
  colnames(df)<-c("id","val1","val2")

id val1 val2

  a    2    3

  a    3    4

  a    3    5

  b    1    3

  b    2    6

  r    4    5

I would like to keep remove all duplicates by id with the condition that for the corresponding rows they do not have the maximum value for val2.

Thus the data frame should become:

  a    3    5

  b    2    6

  r    4    5

-> remove all a duplicates but keep row with the max value for df$val2 for subset(df, df$id=="a")

score 18 · Accepted Answer · answered Sep 21 '14 at 19:22

18

Using base R. Here, the columns are factors. Make sure to convert it to numeric

 df$val2 <- as.numeric(as.character(df$val2))
 df[with(df, ave(val2, id, FUN=max)==val2),]
 #  id val1 val2
 #3  a    3    5
 #5  b    2    6
 #6  r    4    5

Or using dplyr

 library(dplyr)
 df %>% 
    group_by(id) %>% 
    filter(val2==max(val2))
 #   id val1 val2
 #1  a    3    5
 #2  b    2    6
 #3  r    4    5

answered Sep 21 '14 at 19:22

akrun

874,273
37
540
662

You''ll probably need to add `mutate(val2 = as.numeric(as.character(val2))) %>%` to your `dplyr` solution – David Arenburg Sep 21 '14 at 19:29
@David Arenburg I used the converted `df` from the previous solution. Yes, you are right about it, but I am guessing that the OP might have made a mistake in constructing the example. – akrun Sep 21 '14 at 19:31
(+1) for posting both of these one split second before I was going to :) – David Arenburg Sep 21 '14 at 19:32
This works great, but if I have millions of rows with very few duplicates, it takes quite a while - that is the code will also have to create millions of groups that consist of only 1 element. Is there any way to improve the solution in such a case? – user3032689 Sep 19 '16 at 15:52
@user3032689 In that case, `df %>% group_by(id) %>% arrange(desc(val2)) %>% slice(1L)` should be faster. – akrun Sep 19 '16 at 15:55
Thanks @akrun. Do you also know the solution in `datatable`? I am not using `plyr` atm and probably it's a one liner as well? – user3032689 Sep 19 '16 at 15:59
1

@user3032689 With data.table `setDT(df)[order(-val2), head(.SD, 1), id]` – akrun Sep 19 '16 at 16:01

score 8 · Answer 2 · answered Sep 21 '14 at 19:20

8

One possible way is to use data.table

library(data.table)
setDT(df)[, .SD[which.max(val2)], by = id]
##    id val1 val2
## 1:  a    3    5
## 2:  b    2    6
## 3:  r    4    5

answered Sep 21 '14 at 19:20

David Arenburg

91,361
17
137
196

This is great!Thank you! However, I am using the R integration with Spotfire and there have been issues with installing packages, compatibility between OS, versions, software version, statistical services etc and it would be nice to have a solution that does not require additional libraries or loops . – agatha Sep 21 '14 at 19:24
Very nice! but there is a faster way using setDT(df)[df[, .I[which.max(val2)], by = id]$V1]. Hope it helps for larger datasets. – Jordi Aceiton Oct 23 '20 at 21:01

Rich Scriven · Answer 3 · 2014-09-21T20:24:03.470

Here's how I hope your data is really set up

df <- data.frame (id = c(rep("a", 3), rep("b", 2), "r"),
                  val1 = c(2, 3, 3, 1, 2, 4), val2 = c(3, 4, 5, 3, 6, 5))

You could do a split-unsplit

> unsplit(lapply(split(df, df$id), function(x) {
      if(nrow(x) > 1) {
          x[duplicated(x$id) & x$val2 == max(x$val2),]
      } else {
          x
      }
  }), levels(df$id))
#   id val1 val2
# 3  a    3    5
# 5  b    2    6
# 6  r    4    5

You can also use Reduce(rbind, ...) or do.call(rbind, ...) in place of unsplit

score 1 · Answer 4 · edited Oct 14 '20 at 13:35

1

Another one

df %>% group_by(id) %>%
         slice(which.max(val2))

  id   val1  val2
   a    3     5
   b    2     6
   r    4     5

edited Oct 14 '20 at 13:35

General Grievance

4,555
31
31
45

answered Oct 14 '20 at 12:52

JLCEBRIAN

11
1

Remove duplicates based on 2nd column condition

4 Answers4

Linked

Related