0

I have a 3-columns data.frame (variables: ID.A, ID.B, DISTANCE). I would like to remove the duplicates under a condition: keeping the row with the smallest value in column 3.

It is the same problem than here : R, conditionally remove duplicate rows (Similar one: Remove duplicates based on 2nd column condition)

But, in my situation, there is second problem : I have to remove rows when the couples (ID.A, ID.B, DISTANCE) are duplicated, and not only when ID.A is duplicated.

I tried several things, such as:

df <- ddply(df, 1:3, function(df) return(df[df$DISTANCE==min(df$DISTANCE),]))

but it didn't work

Example :

This dataset

    id.a id.b dist
1    1    1   12
2    1    1   10
3    1    1   8
4    2    1   20
5    1    1   15
6    3    1   16

Should become:

    id.a id.b dist
3    1    1   8
4    2    1   20
6    3    1   16
Community
  • 1
  • 1
Spes Alpha
  • 27
  • 1
  • 9
  • this is a perfect job for `dplyr`, but your question doesn't make sense. What does "keep the smallest value in column 3" and "remove rows were triplets are duplicated" mean? If The triplet is duplicated then necessarily there is only one value in column 3? Am I missing something? – Alex Jul 07 '15 at 00:32
  • see this question: http://stackoverflow.com/questions/22959635/remove-duplicated-rows-using-dplyr – Alex Jul 07 '15 at 00:34
  • 1
    @Alex - i'm assuming OP means the couplet `ID.A / ID.B`, but they should clarify. – thelatemail Jul 07 '15 at 00:35
  • I added an example. @thelatermail you're right, it wasn't clear, it's couple ID.A / ID.B, with condition on column 3 – Spes Alpha Jul 07 '15 at 00:42
  • 1
    I'm pretty sure this is a duplicate of one of the linked queries - you can just run an `order` operation first to get what you want `dat <- dat[do.call(order, dat),]; dat[!duplicated(dat[1:2]),]` – thelatemail Jul 07 '15 at 00:46
  • It's more clear. Thank you! – Spes Alpha Jul 07 '15 at 00:58

2 Answers2

3

Using dplyr, and a suitable modification to Remove duplicated rows using dplyr

library(dplyr)

df %>%
group_by(id.a, id.b) %>%
arrange(dist) %>% # in each group, arrange in ascending order by distance
filter(row_number() == 1)
Community
  • 1
  • 1
Alex
  • 15,186
  • 15
  • 73
  • 127
  • Why not use `distinct()`: `df %>% group_by(id.a, id.b) %>% arrange(dist) %>% distinct` – Steven Beaupré Jul 07 '15 at 01:49
  • 1
    I don't know how to use `distinct`. – Alex Jul 07 '15 at 01:54
  • 1
    Well, I guess now you do :) – Steven Beaupré Jul 07 '15 at 01:55
  • I find `distinct` rather confusing, as the helpfile doesn't really explain how it works with grouped data frames. – Alex Jul 07 '15 at 02:03
  • 1
    @Alex If the data frame is grouped, then `distinct` simply finds the distinct entries within each group. – Claus Wilke Jul 07 '15 at 03:10
  • Claus is correct. Maybe if you look at it this way: `df %>% arrange(dist) %>% distinct(id.a, id.b)` – Steven Beaupré Jul 07 '15 at 03:20
  • @ClausWilke I don't think that is correct, unless you mean "find distinct entries within groups". I do not see why the following two should be equivalent: `df %>% distinct(id.a, id.b)` and `df %>% group_by(id.a, id.b) %>% distinct()` – Alex Jul 07 '15 at 03:55
  • @StevenBeaupré @alex `distinct()` with groups is indeed strange, and doesn't always behave as one would expect. Note that the documentation available through `?distinct` doesn't talk about grouping at all. Maybe `distinct` is not meant to be used on grouped variables after all. – Claus Wilke Jul 07 '15 at 04:30
2

Another way of achieving the solution and retaining all the columns:

df %>% arrange(dist) %>% 
  distinct(id.a, id.b, .keep_all=TRUE)

#   id.a id.b dist
# 1    1    1    8
# 2    3    1   16
# 3    2    1   20
Prradep
  • 5,506
  • 5
  • 43
  • 84