Subset with unique cases, based on multiple columns

Question

I'd like to subset a dataframe to include only rows that have unique combinations of three columns. My situation is similar to the one presented in this question, but I'd like to preserve the other columns in my data as well. Here's my example:

> df
  v1  v2  v3   v4  v5
1  7   1   A  100  98 
2  7   2   A   98  97
3  8   1   C   NA  80
4  8   1   C   78  75
5  8   1   C   50  62
6  9   3   C   75  75

The requested output would be something like this, where I'm looking for unique cases based on v1, v2, and v3 only:

> df.new
  v1  v2  v3   v4  v5
1  7   1   A  100  98 
2  7   2   A   98  97
3  8   1   C   NA  80
6  9   3   C   75  75

If I could recover the non-unique rows that would be great too:

> df.dupes
  v1  v2  v3   v4  v5
3  8   1   C   NA  80
4  8   1   C   78  75
5  8   1   C   50  62

I saw a related question for how to do this in sql (here), but I can't get this in R. I'm sure it's simple but messing with unique() and subset() hasn't been fruitful.

Ken Williams · Accepted Answer · 2012-07-09T19:00:51.153

66

You can use the duplicated() function to find the unique combinations:

> df[!duplicated(df[1:3]),]
  v1 v2 v3  v4 v5
1  7  1  A 100 98
2  7  2  A  98 97
3  8  1  C  NA 80
6  9  3  C  75 75

To get only the duplicates, you can check it in both directions:

> df[duplicated(df[1:3]) | duplicated(df[1:3], fromLast=TRUE),]
  v1 v2 v3 v4 v5
3  8  1  C NA 80
4  8  1  C 78 75
5  8  1  C 50 62

edited Jul 09 '12 at 19:00

answered Jul 06 '12 at 21:46

Ken Williams

22,756
10
85
147

First part works, but for the second part will that only recover two duplicates (I edited the example to demonstrate that there can be more than two duplicates) – bosbmgatl Jul 06 '12 at 21:50
@user1202761 You should try Ken's code again. It returns all three duplicated rows for me. – joran Jul 08 '12 at 00:20
@joran you are correct! Turns out after some examination that my data didn't have more than two duplicates, and that's why I couldn't get three to show! Thanks for the heads up. – bosbmgatl Jul 08 '12 at 01:21
1

I'd rather not ask a new question about this, so: how can I do the same based on column 1 and 3, but NOT column 2? I tried 1,3 instead of 1:3 but it doesn't work. – Isak Sep 22 '15 at 16:09
@user3343907 That would be `c(1,3)`. – Ken Williams Sep 22 '15 at 16:18
Thanks, but I tried and failed! Where exactly should the c be in this: df[!duplicated (df[1,3]),] – Isak Sep 22 '15 at 16:23
6

@user3343907 Like so: `df[!duplicated( df[c(1,3)] ), ]` – Ken Williams Sep 22 '15 at 16:28

score 27 · Answer 2 · answered Mar 31 '18 at 17:21

27

Using dplyr you could do:

library(dplyr)

# distinct
df %>% 
  distinct(v1, v2, v3, .keep_all = T)

# non-distinct only
df %>% 
  group_by(v1, v2, v3) %>% 
  filter(n() > 1)

# exclude any non-distinct
df %>% 
  group_by(v1, v2, v3) %>% 
  filter(n() == 1)

answered Mar 31 '18 at 17:21

sbha

9,802
2
74
62

I'll add that `distinct` will keep the first row of a duplicate set and discard the rest. There is no averaging or imputing going on. – jiggunjer Sep 10 '19 at 04:00

score 5 · Answer 3 · answered Jul 07 '12 at 17:18

You can use the plyr package:

library(plyr)

ddply(df, c("v1","v2","v3"), head, 1)
#   v1 v2 v3  v4 v5
# 1  7  1  A 100 98
# 2  7  2  A  98 97
# 3  8  1  C  NA 80
# 4  9  3  C  75 75

ddply(df, c("v1","v2","v3"), function(x) if(nrow(x)>1) x else NULL)
#   v1 v2 v3 v4 v5
# 1  8  1  C NA 80
# 2  8  1  C 78 75
# 3  8  1  C 50 62

score 2 · Answer 4 · answered Dec 02 '12 at 08:46

yeah but using plyr and ddply is very very slow if you have too much data.

you shd try something of this sort:

df[ cbind( which(duplicated(df[1:3])), which(duplicated(df[1:3], fromLast=TRUE))),]

or::

from = which(duplicated(df[1:3])
to = which(duplicated(df[1:3], fromLast=TRUE))
df[cbind(from,to),]

shd be faster for the most part.

test it out and let us know

there are some errors but im guessing you could fix those as long as you get the idea.

also try unique and all that

`dplyr` has basically replaced plyr and ddply, Now both the syntax is clean and logical AND it runs faster than base R. — leerssej, Mar 29 '17 at 17:34

score 2 · Answer 5 · answered Jul 15 '20 at 14:37

I know it's an very old question, but anyway thought that the obvious solution using the unique() function should also be presented here:

unique(df[1:3])

or specifiying columns by names:

unique(df[c("v1","v2","v3)]

...and specifying rows:

unique(df[,c("v1","v2","v3)]

score 2 · Answer 6 · edited Nov 04 '21 at 08:27

2

Provided you're working with data.table, it's easy to have unique cases with respect to one or more variables. The syntax would be the following

unique(DT, by = c("var1", "var2"))

edited Nov 04 '21 at 08:27

Magasinus

83
11

answered Aug 08 '20 at 17:37

Gilbert M.

67
4

nullalleles · Answer 7 · 2013-02-12T18:48:42.970

1

A non-elegant but functional way is to paste the entries of a given row together and find which are unique (or non-unique) rows, something like:

df.vector=apply(df,1,FUN=function(x) {paste(x,collapse="")})
df.table=table(df.vector)

then get the indexes of the duplicates with something like:

which(df.vector%in%names(which(df.table>1)))

edited Feb 12 '13 at 18:48

answered Feb 12 '13 at 18:35

nullalleles

11
2

Subset with unique cases, based on multiple columns

7 Answers7

Linked

Related