Finding ALL duplicate rows, including "elements with smaller subscripts"

Question

R's duplicated returns a vector showing whether each element of a vector or data frame is a duplicate of an element with a smaller subscript. So if rows 3, 4, and 5 of a 5-row data frame are the same, duplicated will give me the vector

FALSE, FALSE, FALSE, TRUE, TRUE

But in this case I actually want to get

FALSE, FALSE, TRUE, TRUE, TRUE

that is, I want to know whether a row is duplicated by a row with a larger subscript too.

score 185 · Accepted Answer · edited Oct 23 '19 at 08:36

185

duplicated has a fromLast argument. The "Example" section of ?duplicated shows you how to use it. Just call duplicated twice, once with fromLast=FALSE and once with fromLast=TRUE and take the rows where either are TRUE.

Some late Edit: You didn't provide a reproducible example, so here's an illustration kindly contributed by @jbaums

vec <- c("a", "b", "c","c","c") 
vec[duplicated(vec) | duplicated(vec, fromLast=TRUE)]
## [1] "c" "c" "c"

Edit: And an example for the case of a data frame:

df <- data.frame(rbind(c("a","a"),c("b","b"),c("c","c"),c("c","c")))
df[duplicated(df) | duplicated(df, fromLast=TRUE), ]
##   X1 X2
## 3  c  c
## 4  c  c

edited Oct 23 '19 at 08:36

hanna

627
9
15

answered Oct 21 '11 at 19:56

Joshua Ulrich

173,410
32
338
418

4

Hold on, I just ran a test and found I was wrong: `x <- c(1:9, 7:10, 5:22); y <- c(letters, letters[1:5]); test <- data.frame(x, y); test[duplicated(test$x) | duplicated(test$x, fromLast=TRUE), ]` Returned all three of he copies of 7, 8, and 9. Why does that work? – JoeM05 Apr 09 '17 at 21:21
2

Because the middle ones are captured no matter if you start from the end or from the front. For example, `duplicated(c(1,1,1))` vs `duplicated(c(1,1,1,), fromLast = TRUE)` gives `c(FALSE,TRUE,TRUE)` and `c(TRUE,TRUE,FALSE)`. Middle value is `TRUE` in both cases. Taking `|` of both vectors gives `c(TRUE,TRUE,TRUE)`. – Brandon Mar 11 '18 at 02:36

score 42 · Answer 2 · answered Oct 21 '11 at 19:49

42

You need to assemble the set of duplicated values, apply unique, and then test with %in%. As always, a sample problem will make this process come alive.

> vec <- c("a", "b", "c","c","c")
> vec[ duplicated(vec)]
[1] "c" "c"
> unique(vec[ duplicated(vec)])
[1] "c"
>  vec %in% unique(vec[ duplicated(vec)]) 
[1] FALSE FALSE  TRUE  TRUE  TRUE

answered Oct 21 '11 at 19:49

IRTFM

258,963
21
364
487

Agree. Might even slow down processing but unlikely to slow it down very much. – IRTFM Jun 21 '18 at 14:42
Quite true. The OP did not offer a data example to test for "ever duplicated" rows in a dataframe. I think my suggestion of using `duplicated`, `unique` and `%in%` could easily be generalized to a dataframe if one were to first `paste` each row with an unusual separator character. (The accepted answer is better.) – IRTFM Jun 06 '19 at 22:09

score 23 · Answer 3 · edited Feb 25 '21 at 00:50

23

Duplicated rows in a dataframe could be obtained with dplyr by doing

library(tidyverse)
df = bind_rows(iris, head(iris, 20)) # build some test data
df %>% group_by_all() %>% filter(n()>1) %>% ungroup()

To exclude certain columns group_by_at(vars(-var1, -var2)) could be used instead to group the data.

If the row indices and not just the data is actually needed, you could add them first as in:

df %>% add_rownames %>% group_by_at(vars(-rowname)) %>% filter(n()>1) %>% pull(rowname)

edited Feb 25 '21 at 00:50

IRTFM

258,963
21
364
487

answered Jun 17 '19 at 13:47

Holger Brandl

10,634
3
64
63

2

Nice use of `n()`. Don't forget to ungroup the resulting dataframe. – qwr Jul 02 '19 at 21:35
@qwr I've adjusted the answer to ungroup the result – Holger Brandl Jul 03 '19 at 07:21
@HolgerBrandl, @qwr, The general answer is useful, but I don't understand how to pick column(s) to exclude. What is the "vars" refer to in `group_by_at(vars(-var1, -var2))`? Are `var1` and `var2` column names in a datatable named `vars`? I assume the negative signs signify exclusion, right? So the rest of the process (`filter` and `ungroup`) acts on the rest of the columns in that datatable `vars`, but not including `var1` and `var2` is that right? Sorry to be so pedantic, but I often have problems with quick shorthand! – W Barker Jul 08 '21 at 12:19
`vars` is a method in dplyr, see https://dplyr.tidyverse.org/reference/vars.html . var1, var2 indeed refer to column names to be excluded from the duplication check. Duplication is assessed on the grouping variables in the suggested solution. Indeed, negative signifies exclusion. – Holger Brandl Jul 08 '21 at 20:23
2

`group_by_all()` and `group_by_at()` have been superseded in recent versions of dplyr. Now you can do this: `iris %>% group_by(across()) %>% filter(n() > 1) %>% ungroup()` – MCornejo Sep 29 '21 at 20:50

canderson156 · Answer 4 · 2020-04-23T15:33:04.957

4

Here is @Joshua Ulrich's solution as a function. This format allows you to use this code in the same fashion that you would use duplicated():

allDuplicated <- function(vec){
  front <- duplicated(vec)
  back <- duplicated(vec, fromLast = TRUE)
  all_dup <- front + back > 0
  return(all_dup)
}

Using the same example:

vec <- c("a", "b", "c","c","c") 
allDuplicated(vec) 
[1] FALSE FALSE  TRUE  TRUE  TRUE

edited Apr 23 '20 at 15:33

answered Apr 23 '20 at 15:26

canderson156

1,045
10
24

François M. · Answer 5 · 2017-10-20T14:13:49.693

3

I've had the same question, and if I'm not mistaken, this is also an answer.

vec[col %in% vec[duplicated(vec$col),]$col]

Dunno which one is faster, though, the dataset I'm currently using isn't big enough to make tests which produce significant time gaps.

edited Oct 20 '17 at 14:13

answered Jun 01 '16 at 14:26

François M.

4,027
11
30
81

1

This answer seems to use `vec` both as an atomic vector and as a dataframe. I suspect that with an actual datframe it would fail. – IRTFM Jun 21 '18 at 14:44

score 2 · Answer 6 · answered May 15 '20 at 19:52

I had a similar problem but I needed to identify duplicated rows by values in specific columns. I came up with the following dplyr solution:

df <- df %>% 
  group_by(Column1, Column2, Column3) %>% 
  mutate(Duplicated = case_when(length(Column1)>1 ~ "Yes",
                            TRUE ~ "No")) %>%
  ungroup()

The code groups the rows by specific columns. If the length of a group is greater than 1 the code marks all of the rows in the group as duplicated. Once that is done you can use Duplicated column for filtering etc.

score 2 · Answer 7 · answered Jul 19 '22 at 23:37

This is how vctrs::vec_duplicate_detect() works

# on a vector
vctrs::vec_duplicate_detect(c(1, 2, 1))
#> [1]  TRUE FALSE  TRUE
# on a data frame
vctrs::vec_duplicate_detect(mtcars[c(1, 2, 1),])
#> [1]  TRUE FALSE  TRUE

^{Created on 2022-07-19 by the reprex package (v2.0.1)}

qwr · Answer 8 · 2019-06-06T22:08:55.127

If you are interested in which rows are duplicated for certain columns you can use a plyr approach:

ddply(df, .(col1, col2), function(df) if(nrow(df) > 1) df else c())

Adding a count variable with dplyr:

df %>% add_count(col1, col2) %>% filter(n > 1)  # data frame
df %>% add_count(col1, col2) %>% select(n) > 1  # logical vector

For duplicate rows (considering all columns):

df %>% group_by_all %>% add_tally %>% ungroup %>% filter(n > 1)
df %>% group_by_all %>% add_tally %>% ungroup %>% select(n) > 1

The benefit of these approaches is that you can specify how many duplicates as a cutoff.

score 0 · Answer 9 · answered Sep 29 '21 at 21:24

This updates @Holger Brandl's answer to reflect recent versions of dplyr (e.g. 1.0.5), in which group_by_all() and group_by_at() have been superseded. The help doc suggests using across() instead.

Thus, to get all rows for which there is a duplicate you can do this: iris %>% group_by(across()) %>% filter(n() > 1) %>% ungroup()

To include the indices of such rows, add a 'rowid' column but exclude it from the grouping: iris %>% rowid_to_column() %>% group_by(across(!rowid)) %>% filter(n() > 1) %>% ungroup()

Append %>% pull(rowid) after the above and you'll get a vector of the indices.

Finding ALL duplicate rows, including "elements with smaller subscripts"

9 Answers9

Linked

Related