101

I have a dataframe with >100 columns, and I would to find the unique rows by comparing only two of the columns. I'm hoping this is an easy one, but I can't get it to work with unique or duplicated myself.

In the below, I would like to unique only using id and id2:

data.frame(id=c(1,1,3),id2=c(1,1,4),somevalue=c("x","y","z"))

id id2 somevalue
1   1         x
1   1         y
3   4         z

I would like to obtain either:

id id2 somevalue
1   1         x
3   4         z

or:

id id2 somevalue
1   1         y
3   4         z

(I have no preference which of the unique rows is kept)

AndrewGB
  • 16,126
  • 5
  • 18
  • 49
Ina
  • 4,400
  • 6
  • 30
  • 44
  • Your desired output isn't well defined. How do you choose to include x but not y? This decision will need to be made for every column of every repeated row, and you've given no indication of how to do it. – joran Mar 30 '12 at 14:33
  • 1
    I have no preference whether 'x' or 'y' is included. I'll update the question. – Ina Mar 30 '12 at 14:34
  • For `data.table` alternatives: [Filtering out duplicated/non-unique rows in data.table](https://stackoverflow.com/questions/11792527/filtering-out-duplicated-non-unique-rows-in-data-table) – Henrik Jul 10 '18 at 15:21

4 Answers4

146

Ok, if it doesn't matter which value in the non-duplicated column you select, this should be pretty easy:

dat <- data.frame(id=c(1,1,3),id2=c(1,1,4),somevalue=c("x","y","z"))
> dat[!duplicated(dat[,c('id','id2')]),]
  id id2 somevalue
1  1   1         x
3  3   4         z

Inside the duplicated call, I'm simply passing only those columns from dat that I don't want duplicates of. This code will automatically always select the first of any ambiguous values. (In this case, x.)

joran
  • 169,992
  • 32
  • 429
  • 468
  • @Ferroao according to this website, it's not possible in SQL http://www.dofactory.com/sql/select-distinct – 3pitt Oct 18 '17 at 19:22
  • what if the value in the non-duplicated column matters, and we want to keep all the possible combinations? https://stackoverflow.com/questions/75300005/extract-all-possible-combinations-of-rows-with-unique-values-in-a-variable – SteveMcManaman Jan 31 '23 at 16:36
41

Here are a couple dplyr options that keep non-duplicate rows based on columns id and id2:

library(dplyr)                                        
df %>% distinct(id, id2, .keep_all = TRUE)
df %>% group_by(id, id2) %>% filter(row_number() == 1)
df %>% group_by(id, id2) %>% slice(1)
sbha
  • 9,802
  • 2
  • 74
  • 62
  • 4
    I guess no reason to consider using the alternatives to `distinct` – Frank Jul 17 '18 at 18:40
  • 1
    @Frank maybe if there is a date or some other sequential field a combination of the two other options with some slight tweaks could be used to ensure the most recent observation is kept `df %>% group_by(id, id2) %>% filter(date == max(date)) %>% slice(1)` without having to sort the data. `filter()` gets the most recent date, `slice()` ensures only one observation is returned if there are ties. In other cases something like `df %>% group_by(id, id2) %>% slice(max(row_number()))` could might give some more flexibility. You could always use `arrange()` before `distinct()` too. – sbha Feb 09 '19 at 12:32
  • @sbha Is there a method to designate a preference for a row with a certain column value when there is a tie in the column you are grouping on? In the case of the example in the question, the row with `somevalue == x` is always returned when the row is a duplicate in the `id` and `id2` columns. – Lorcán May 20 '19 at 10:57
  • what if the value in the non-duplicated column matters, and we want to keep all the possible combinations? https://stackoverflow.com/questions/75300005/extract-all-possible-combinations-of-rows-with-unique-values-in-a-variable – SteveMcManaman Jan 31 '23 at 16:36
15

Using unique():

dat <- data.frame(id=c(1,1,3),id2=c(1,1,4),somevalue=c("x","y","z"))    
dat[row.names(unique(dat[,c("id", "id2")])),]
Gary Feng
  • 420
  • 5
  • 8
1

Minor update in @Joran's code.
Using the code below, you can avoid the ambiguity and only get the unique of two columns:

dat <- data.frame(id=c(1,1,3), id2=c(1,1,4) ,somevalue=c("x","y","z"))    
dat[row.names(unique(dat[,c("id", "id2")])), c("id", "id2")]
taras
  • 6,566
  • 10
  • 39
  • 50