Select unique values with 'select' function in 'dplyr' library

Question

Is it possible to select all unique values from a column of a data.frame using select function in dplyr library? Something like "SELECT DISTINCT field1 FROM table1" in SQL notation.

Thanks!

Ron Gejman · Accepted Answer · 2015-12-17T13:28:51.050

106

In dplyr 0.3 this can be easily achieved using the distinct() method.

Here is an example:

distinct_df = df %>% distinct(field1)

You can get a vector of the distinct values with:

distinct_vector = distinct_df$field1

You can also select a subset of columns at the same time as you perform the distinct() call, which can be cleaner to look at if you examine the data frame using head/tail/glimpse.:

distinct_df = df %>% distinct(field1) %>% select(field1) distinct_vector = distinct_df$field1

edited Dec 17 '15 at 13:28

answered Oct 22 '14 at 22:54

Ron Gejman

6,135
3
25
34

3

This works if the data frame is already in R, but it doesn't work if you're trying to do the query directly on the database via a db connection (i.e. `src_postgres()`). It reports: `Error: Can't calculate distinct only on specified columns with SQL` – djhocking Jan 15 '15 at 16:28
See this question for how to connect the src_postgres() and dplyr http://stackoverflow.com/questions/21592266/i-cannot-connect-postgresql-schema-table-with-dplyr-package – Ron Gejman Mar 08 '15 at 14:43
20

Note that the way `distinct()` works has changed in dplyr 0.5. By default `distinct()` now only returns the columns that are used as arguments to `distinct()`. If you want to retain the other columns you now have to pass `.keep_all = TRUE` as an additional argument to `distinct()` – RoyalTS Jul 30 '16 at 15:37
2

Yep, dplyr 0.5 broke my code previously written using 0.3 and distinct. Why the change? The previous default behavior was useful and the natural way to do it. – user1905004 Oct 01 '16 at 00:52

score 25 · Answer 2 · edited Aug 18 '21 at 17:09

25

Just to add to the other answers, if you would prefer to return a vector rather than a dataframe, you have the following options:

dplyr >= 0.7.0

Use the pull verb:

mtcars %>% distinct(cyl) %>% pull()

dplyr < 0.7.0

Enclose the dplyr functions in a parentheses and combine it with $ syntax:

(mtcars %>% distinct(cyl))$cyl

edited Aug 18 '21 at 17:09

mhovd

3,724
2
21
47

answered Oct 20 '16 at 10:57

Josh Gilfillan

4,348
2
24
26

I love that you identified `pull` as a verb -- more poetic (and descriptive) than "function"! – butterflyeffect Sep 21 '22 at 23:22

eipi10 · Answer 3 · 2014-08-29T16:18:29.250

The dplyr select function selects specific columns from a data frame. To return unique values in a particular column of data, you can use the group_by function. For example:

library(dplyr)

# Fake data
set.seed(5)
dat = data.frame(x=sample(1:10,100, replace=TRUE))

# Return the distinct values of x
dat %>%
  group_by(x) %>%
  summarise() 

    x
1   1
2   2
3   3
4   4
5   5
6   6
7   7
8   8
9   9
10 10

If you want to change the column name you can add the following:

dat %>%
  group_by(x) %>%
  summarise() %>%
  select(unique.x=x)

This both selects column x from among all the columns in the data frame that dplyr returns (and of course there's only one column in this case) and changes its name to unique.x.

You can also get the unique values directly in base R with unique(dat$x).

If you have multiple variables and want all unique combinations that appear in the data, you can generalize the above code as follows:

set.seed(5)
dat = data.frame(x=sample(1:10,100, replace=TRUE), 
                 y=sample(letters[1:5], 100, replace=TRUE))

dat %>% 
  group_by(x,y) %>%
  summarise() %>%
  select(unique.x=x, unique.y=y)

Or use the new `distinct()` function in dplyr 0.3 – hadley Sep 01 '14 at 15:04 — hadley, Sep 01 '14 at 15:04

Select unique values with 'select' function in 'dplyr' library

3 Answers3

Linked

Related