52

I'd like to subset a dataframe to include only rows that have unique combinations of three columns. My situation is similar to the one presented in this question, but I'd like to preserve the other columns in my data as well. Here's my example:

> df
  v1  v2  v3   v4  v5
1  7   1   A  100  98 
2  7   2   A   98  97
3  8   1   C   NA  80
4  8   1   C   78  75
5  8   1   C   50  62
6  9   3   C   75  75

The requested output would be something like this, where I'm looking for unique cases based on v1, v2, and v3 only:

> df.new
  v1  v2  v3   v4  v5
1  7   1   A  100  98 
2  7   2   A   98  97
3  8   1   C   NA  80
6  9   3   C   75  75

If I could recover the non-unique rows that would be great too:

> df.dupes
  v1  v2  v3   v4  v5
3  8   1   C   NA  80
4  8   1   C   78  75
5  8   1   C   50  62

I saw a related question for how to do this in sql (here), but I can't get this in R. I'm sure it's simple but messing with unique() and subset() hasn't been fruitful.

starball
  • 20,030
  • 7
  • 43
  • 238
bosbmgatl
  • 928
  • 3
  • 9
  • 12

7 Answers7

66

You can use the duplicated() function to find the unique combinations:

> df[!duplicated(df[1:3]),]
  v1 v2 v3  v4 v5
1  7  1  A 100 98
2  7  2  A  98 97
3  8  1  C  NA 80
6  9  3  C  75 75

To get only the duplicates, you can check it in both directions:

> df[duplicated(df[1:3]) | duplicated(df[1:3], fromLast=TRUE),]
  v1 v2 v3 v4 v5
3  8  1  C NA 80
4  8  1  C 78 75
5  8  1  C 50 62
Ken Williams
  • 22,756
  • 10
  • 85
  • 147
  • First part works, but for the second part will that only recover two duplicates (I edited the example to demonstrate that there can be more than two duplicates) – bosbmgatl Jul 06 '12 at 21:50
  • @user1202761 You should try Ken's code again. It returns all three duplicated rows for me. – joran Jul 08 '12 at 00:20
  • @joran you are correct! Turns out after some examination that my data didn't have more than two duplicates, and that's why I couldn't get three to show! Thanks for the heads up. – bosbmgatl Jul 08 '12 at 01:21
  • 1
    I'd rather not ask a new question about this, so: how can I do the same based on column 1 and 3, but NOT column 2? I tried 1,3 instead of 1:3 but it doesn't work. – Isak Sep 22 '15 at 16:09
  • @user3343907 That would be `c(1,3)`. – Ken Williams Sep 22 '15 at 16:18
  • Thanks, but I tried and failed! Where exactly should the c be in this: df[!duplicated (df[1,3]),] – Isak Sep 22 '15 at 16:23
  • 6
    @user3343907 Like so: `df[!duplicated( df[c(1,3)] ), ]` – Ken Williams Sep 22 '15 at 16:28
27

Using dplyr you could do:

library(dplyr)

# distinct
df %>% 
  distinct(v1, v2, v3, .keep_all = T)

# non-distinct only
df %>% 
  group_by(v1, v2, v3) %>% 
  filter(n() > 1)

# exclude any non-distinct
df %>% 
  group_by(v1, v2, v3) %>% 
  filter(n() == 1)
sbha
  • 9,802
  • 2
  • 74
  • 62
  • I'll add that `distinct` will keep the first row of a duplicate set and discard the rest. There is no averaging or imputing going on. – jiggunjer Sep 10 '19 at 04:00
5

You can use the plyr package:

library(plyr)

ddply(df, c("v1","v2","v3"), head, 1)
#   v1 v2 v3  v4 v5
# 1  7  1  A 100 98
# 2  7  2  A  98 97
# 3  8  1  C  NA 80
# 4  9  3  C  75 75

ddply(df, c("v1","v2","v3"), function(x) if(nrow(x)>1) x else NULL)
#   v1 v2 v3 v4 v5
# 1  8  1  C NA 80
# 2  8  1  C 78 75
# 3  8  1  C 50 62
flodel
  • 87,577
  • 21
  • 185
  • 223
2

yeah but using plyr and ddply is very very slow if you have too much data.

you shd try something of this sort:

df[ cbind( which(duplicated(df[1:3])), which(duplicated(df[1:3], fromLast=TRUE))),]

or::

from = which(duplicated(df[1:3])
to = which(duplicated(df[1:3], fromLast=TRUE))
df[cbind(from,to),]

shd be faster for the most part.

test it out and let us know

there are some errors but im guessing you could fix those as long as you get the idea.

also try unique and all that

  • `dplyr` has basically replaced plyr and ddply, Now both the syntax is clean and logical AND it runs faster than base R. – leerssej Mar 29 '17 at 17:34
2

I know it's an very old question, but anyway thought that the obvious solution using the unique() function should also be presented here:

unique(df[1:3])

or specifiying columns by names:

unique(df[c("v1","v2","v3)]

...and specifying rows:

unique(df[,c("v1","v2","v3)]
2

Provided you're working with data.table, it's easy to have unique cases with respect to one or more variables. The syntax would be the following

unique(DT, by = c("var1", "var2"))
Magasinus
  • 83
  • 11
Gilbert M.
  • 67
  • 4
1

A non-elegant but functional way is to paste the entries of a given row together and find which are unique (or non-unique) rows, something like:

df.vector=apply(df,1,FUN=function(x) {paste(x,collapse="")})
df.table=table(df.vector)

then get the indexes of the duplicates with something like:

which(df.vector%in%names(which(df.table>1)))