Unique on a dataframe with only selected columns

Question

I have a dataframe with >100 columns, and I would to find the unique rows by comparing only two of the columns. I'm hoping this is an easy one, but I can't get it to work with unique or duplicated myself.

In the below, I would like to unique only using id and id2:

data.frame(id=c(1,1,3),id2=c(1,1,4),somevalue=c("x","y","z"))

id id2 somevalue
1   1         x
1   1         y
3   4         z

I would like to obtain either:

id id2 somevalue
1   1         x
3   4         z

or:

id id2 somevalue
1   1         y
3   4         z

(I have no preference which of the unique rows is kept)

Your desired output isn't well defined. How do you choose to include x but not y? This decision will need to be made for every column of every repeated row, and you've given no indication of how to do it. — joran, Mar 30 '12 at 14:33
I have no preference whether 'x' or 'y' is included. I'll update the question. — Ina, Mar 30 '12 at 14:34
For `data.table` alternatives: [Filtering out duplicated/non-unique rows in data.table](https://stackoverflow.com/questions/11792527/filtering-out-duplicated-non-unique-rows-in-data-table) — Henrik, Jul 10 '18 at 15:21

score 146 · Accepted Answer · answered Mar 30 '12 at 14:38

146

Ok, if it doesn't matter which value in the non-duplicated column you select, this should be pretty easy:

dat <- data.frame(id=c(1,1,3),id2=c(1,1,4),somevalue=c("x","y","z"))
> dat[!duplicated(dat[,c('id','id2')]),]
  id id2 somevalue
1  1   1         x
3  3   4         z

Inside the duplicated call, I'm simply passing only those columns from dat that I don't want duplicates of. This code will automatically always select the first of any ambiguous values. (In this case, x.)

answered Mar 30 '12 at 14:38

joran

169,992
32
429
468

@Ferroao according to this website, it's not possible in SQL http://www.dofactory.com/sql/select-distinct – 3pitt Oct 18 '17 at 19:22
what if the value in the non-duplicated column matters, and we want to keep all the possible combinations? https://stackoverflow.com/questions/75300005/extract-all-possible-combinations-of-rows-with-unique-values-in-a-variable – SteveMcManaman Jan 31 '23 at 16:36

score 41 · Answer 2 · answered Jul 17 '18 at 18:37

41

Here are a couple dplyr options that keep non-duplicate rows based on columns id and id2:

library(dplyr)                                        
df %>% distinct(id, id2, .keep_all = TRUE)
df %>% group_by(id, id2) %>% filter(row_number() == 1)
df %>% group_by(id, id2) %>% slice(1)

answered Jul 17 '18 at 18:37

sbha

9,802
2
74
62

4

I guess no reason to consider using the alternatives to `distinct` – Frank Jul 17 '18 at 18:40
1

@Frank maybe if there is a date or some other sequential field a combination of the two other options with some slight tweaks could be used to ensure the most recent observation is kept `df %>% group_by(id, id2) %>% filter(date == max(date)) %>% slice(1)` without having to sort the data. `filter()` gets the most recent date, `slice()` ensures only one observation is returned if there are ties. In other cases something like `df %>% group_by(id, id2) %>% slice(max(row_number()))` could might give some more flexibility. You could always use `arrange()` before `distinct()` too. – sbha Feb 09 '19 at 12:32
@sbha Is there a method to designate a preference for a row with a certain column value when there is a tie in the column you are grouping on? In the case of the example in the question, the row with `somevalue == x` is always returned when the row is a duplicate in the `id` and `id2` columns. – Lorcán May 20 '19 at 10:57
what if the value in the non-duplicated column matters, and we want to keep all the possible combinations? https://stackoverflow.com/questions/75300005/extract-all-possible-combinations-of-rows-with-unique-values-in-a-variable – SteveMcManaman Jan 31 '23 at 16:36

score 15 · Answer 3 · answered Oct 22 '15 at 18:35

15

Using unique():

dat <- data.frame(id=c(1,1,3),id2=c(1,1,4),somevalue=c("x","y","z"))    
dat[row.names(unique(dat[,c("id", "id2")])),]

answered Oct 22 '15 at 18:35

Gary Feng

420
5
8

`unique(dat[,c("id", "id2")])` returns vector, not dataframe, so you cannot refer to its `row.names` – Sashko Lykhenko Apr 18 '19 at 07:59
@SashkoLykhenko, did you miss the last comma? – Gary Feng Apr 27 '19 at 23:38
I copypasted this `row.names(unique(dat[,c("id", "id2")]))` and it returned error. Last comma relates to outer `dat[...,]` – Sashko Lykhenko Apr 29 '19 at 08:35

score 1 · Answer 4 · edited Jul 10 '18 at 17:59

1

Minor update in @Joran's code.
Using the code below, you can avoid the ambiguity and only get the unique of two columns:

dat <- data.frame(id=c(1,1,3), id2=c(1,1,4) ,somevalue=c("x","y","z"))    
dat[row.names(unique(dat[,c("id", "id2")])), c("id", "id2")]

edited Jul 10 '18 at 17:59

taras

6,566
10
39
50

answered Jul 10 '18 at 15:13

Vaya Ashish

11
2

3

This looks identical to Gary Feng's answer, except that you don't include the `somevalue` result. – Gregor Thomas Jul 10 '18 at 15:25

Unique on a dataframe with only selected columns

4 Answers4

Linked

Related