R equivalent of SELECT DISTINCT on two or more fields/variables

Question

Say I have a dataframe df with two or more columns, is there an easy way to use unique() or other R function to create a subset of unique combinations of two or more columns?

I know I can use sqldf() and write an easy "SELECT DISTINCT var1, var2, ... varN" query, but I am looking for an R way of doing this.

It occurred to me to try ftable coerced to a dataframe and use the field names, but I also get the cross tabulations of combinations that don't exist in the dataset:

uniques <- as.data.frame(ftable(df$var1, df$var2))

Marek · Accepted Answer · 2021-01-07T11:11:48.277

62

unique works on data.frame so unique(df[c("var1","var2")]) should be what you want.

Another option is distinct from dplyr package:

df %>% distinct(var1, var2) # or distinct(df, var1, var2)

Note:

For older versions of dplyr (< 0.5.0, 2016-06-24) distinct required additional step

df %>% select(var1, var2) %>% distinct

(or oldish way distinct(select(df, var1, var2))).

edited Jan 07 '21 at 11:11

answered May 24 '10 at 22:25

Marek

49,472
15
99
121

tjebo · Answer 2 · 2020-01-30T11:43:50.070

27

@Marek's answer is obviously correct, but may be outdated. The current dplyrversion (0.7.4) allows for an even simpler code:

Simply use:

df %>% distinct(var1, var2)

If you want to keep all columns, add

df %>% distinct(var1, var2, .keep_all = TRUE)

edited Jan 30 '20 at 11:43

answered Mar 01 '18 at 14:33

tjebo

21,977
7
58
94

sbaniwal · Answer 3 · 2017-07-23T19:01:01.573

5

To KEEP all other variables in df use this:

unique_rows <- !duplicated(df[c("var1","var2")])

unique.df <- df[unique_rows,]

Another less recommended method is using row.names() #(see David's comment below):

unique_rows <- row.names(unique(df[c("var1","var2")]))

unique.df <- df[unique_rows,]

edited Jul 23 '17 at 19:01

answered Jul 20 '17 at 19:15

sbaniwal

337
4
6

3

No. Operating over row names is always a bad idea. Just use `duplicated` if you want a boolean vector. – David Arenburg Jul 20 '17 at 20:42
Because you've edited your answer without adding any note/contribution. So no one knew you actually fixed your answer. – David Arenburg Jul 23 '17 at 08:27

score 2 · Answer 4 · answered Apr 19 '20 at 08:39

2

In addition to answers above, the data.table version:

setDT(df)

unique_dt = unique(df, by = c('var1', 'var2'))

answered Apr 19 '20 at 08:39

Zaki

131
3

R equivalent of SELECT DISTINCT on two or more fields/variables

4 Answers4

Linked

Related