Find duplicated elements with dplyr

Question

I tried using the code presented here to find ALL duplicated elements with dplyr like this:

library(dplyr)

mtcars %>%
mutate(cyl.dup = cyl[duplicated(cyl) | duplicated(cyl, from.last = TRUE)])

How can I convert code presented here to find ALL duplicated elements with dplyr? My code above just throws an error? Or even better, is there another function that will achieve this more succinctly than the convoluted x[duplicated(x) | duplicated(x, from.last = TRUE)]) approach?

Marat Talipov · Accepted Answer · 2015-01-30T21:11:01.903

212

I guess you could use filter for this purpose:

mtcars %>% 
  group_by(carb) %>% 
  filter(n()>1)

Small example (note that I added summarize() to prove that the resulting data set does not contain rows with duplicate 'carb'. I used 'carb' instead of 'cyl' because 'carb' has unique values whereas 'cyl' does not):

mtcars %>% group_by(carb) %>% summarize(n=n())
#Source: local data frame [6 x 2]
#
#  carb  n
#1    1  7
#2    2 10
#3    3  3
#4    4 10
#5    6  1
#6    8  1

mtcars %>% group_by(carb) %>% filter(n()>1) %>% summarize(n=n())
#Source: local data frame [4 x 2]
#
#  carb  n
#1    1  7
#2    2 10
#3    3  3
#4    4 10

edited Jan 30 '15 at 21:11

answered Jan 30 '15 at 21:03

Marat Talipov

13,064
5
34
53

15

I also mention the `add_count` function that adds the `n` column on which we can then filter: `mtcars %>% add_count(carb) %>% filter(n>1)` gives the same result with the additional `n` column. – Pascal Martin Feb 09 '19 at 11:49
Note that this results in a grouped dataframe; subsequent use of verbs like 'distinct' will have different results than they would on an ungrouped dataframe; to convert back to an ungrouped data containing only the duplicated records, use ungroup() (https://dplyr.tidyverse.org/articles/grouping.html) – adam.r May 10 '22 at 00:25

score 56 · Answer 2 · answered Sep 18 '18 at 05:57

56

Another solution is to use janitor package:

mtcars %>% get_dupes(wt)

answered Sep 18 '18 at 05:57

radek

7,240
8
58
83

Note that the result of this has all instances of duplicates in the data. For example, if you ran `get_dupes(data.frame(x=c(1,1,1)))` you would get the whole df back. `get_dupes` also adds a column of counts. See [here](http://sfirke.github.io/janitor/reference/get_dupes.html). – Levi Baguley May 30 '20 at 16:13

Keiku · Answer 3 · 2016-08-17T06:54:31.170

23

We can find duplicated elements with dplyr as follows.

library(dplyr)

# Only duplicated elements
mtcars %>%
  filter(duplicated(.[["carb"]])

# All duplicated elements
mtcars %>%
  filter(carb %in% unique(.[["carb"]][duplicated(.[["carb"]])]))

edited Aug 17 '16 at 06:54

answered Aug 17 '16 at 06:47

Keiku

8,205
4
41
44

1

also the following syntax for multiple columns, `mtcars %>% filter(duplicated(.[c("carb", "cyl")])` – RDavey Dec 17 '21 at 12:34

Isaac Freitas · Answer 4 · 2018-05-24T02:01:10.950

The original post contains an error in using the solution from the related answer. In the example given, when you use that solution inside mutate, it tries to subset the cyl vector which will not be of the same length as the mtcars dataframe.

Instead you can use the following example with filter returning all duplicated elements or mutate with ifelse to create a dummy variable which can be filtered upon later:

 library(dplyr)

 # Return all duplicated elements
 mtcars %>%
   filter(duplicated(cyl) | duplicated(cyl, fromLast = TRUE))
 # Or for making dummy variable of all duplicated
 mtcars %>%
   mutate(cyl.dup =ifelse(duplicated(cyl) | duplicated(cyl, fromLast = TRUE), 1,0))

Mauro Lepore · Answer 5 · 2018-03-21T20:01:43.530

# Adding a shortcut to the answer above
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
mtcars %>% count(carb)
#> # A tibble: 6 x 2
#>    carb     n
#>   <dbl> <int>
#> 1    1.     7
#> 2    2.    10
#> 3    3.     3
#> 4    4.    10
#> 5    6.     1
#> 6    8.     1
mtcars %>% count(carb) %>% filter(n > 1)
#> # A tibble: 4 x 2
#>    carb     n
#>   <dbl> <int>
#> 1    1.     7
#> 2    2.    10
#> 3    3.     3
#> 4    4.    10

# Showing an alternative that follows the apparent intention if the asker
duplicated_carb <- mtcars %>% 
  mutate(dup_carb = duplicated(carb)) %>% 
  filter(dup_carb)
duplicated_carb
#>     mpg cyl  disp  hp drat    wt  qsec vs am gear carb dup_carb
#> 1  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4     TRUE
#> 2  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1     TRUE
#> 3  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1     TRUE
#> 4  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4     TRUE
#> 5  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2     TRUE
#> 6  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2     TRUE
#> 7  19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4     TRUE
#> 8  17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4     TRUE
#> 9  17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3     TRUE
#> 10 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3     TRUE
#> 11 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4     TRUE
#> 12 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4     TRUE
#> 13 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4     TRUE
#> 14 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1     TRUE
#> 15 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2     TRUE
#> 16 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1     TRUE
#> 17 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1     TRUE
#> 18 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2     TRUE
#> 19 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2     TRUE
#> 20 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4     TRUE
#> 21 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2     TRUE
#> 22 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1     TRUE
#> 23 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2     TRUE
#> 24 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2     TRUE
#> 25 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4     TRUE
#> 26 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2     TRUE

Re the "alternative", Marat's answer has 30 rows while yours has 26. I guess you can instead add one step to your original pipeline: `mtcars %>% count(carb) %>% filter(n > 1) %>% left_join(mtcars)` maybe with select(-n) first. — Frank, Mar 21 '18 at 20:06

score 0 · Answer 6 · answered Feb 07 '22 at 22:39

You can create a Boolean mask with duplicated():

iris %>% duplicated()
  [1] FALSE FALSE FALSE .... TRUE FALSE
[145] FALSE FALSE FALSE FALSE FALSE FALSE

And pass through square brackets indexing:

iris[iris %>% duplicated(),]
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
143          5.8         2.7          5.1         1.9 virginica

Note: This approach is the closest thing to Pandas that could be done with R and dplyr:

iris[iris %>% duplicated(), c("Petal.Length","Petal.Width","Species")]
    Petal.Length Petal.Width   Species
143          5.1         1.9 virginica

score 0 · Answer 7 · answered Apr 01 '22 at 17:05

A more general solution if you want to group duplicates using many columns

df%>%
  select(ID,COL1,COL2,all_of(vector_of_columns))%>%
  distinct%>%
  ungroup%>%rowwise%>%
  mutate(ID_GROUPS=paste0(ID,"_",cur_group_rows()))%>%
  ungroup%>%
  full_join(.,df,by=c("INFO_ID","COL1","COL2",vector_of_columns))->chk

score -1 · Answer 8 · edited Dec 03 '21 at 22:55

-1

Find duplicate value in data frame with column

df<-dataset[duplicated(dataset$columnname),]

edited Dec 03 '21 at 22:55

Henry Ecker

34,399
18
41
57

answered Nov 02 '21 at 19:17

Rupesh Kumar

157
3

Find duplicated elements with dplyr

8 Answers8

Linked

Related