160

R's duplicated returns a vector showing whether each element of a vector or data frame is a duplicate of an element with a smaller subscript. So if rows 3, 4, and 5 of a 5-row data frame are the same, duplicated will give me the vector

FALSE, FALSE, FALSE, TRUE, TRUE

But in this case I actually want to get

FALSE, FALSE, TRUE, TRUE, TRUE

that is, I want to know whether a row is duplicated by a row with a larger subscript too.

smci
  • 32,567
  • 20
  • 113
  • 146
Lauren Samuels
  • 2,419
  • 3
  • 19
  • 20

9 Answers9

185

duplicated has a fromLast argument. The "Example" section of ?duplicated shows you how to use it. Just call duplicated twice, once with fromLast=FALSE and once with fromLast=TRUE and take the rows where either are TRUE.


Some late Edit: You didn't provide a reproducible example, so here's an illustration kindly contributed by @jbaums

vec <- c("a", "b", "c","c","c") 
vec[duplicated(vec) | duplicated(vec, fromLast=TRUE)]
## [1] "c" "c" "c"

Edit: And an example for the case of a data frame:

df <- data.frame(rbind(c("a","a"),c("b","b"),c("c","c"),c("c","c")))
df[duplicated(df) | duplicated(df, fromLast=TRUE), ]
##   X1 X2
## 3  c  c
## 4  c  c
hanna
  • 627
  • 9
  • 15
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • 4
    Hold on, I just ran a test and found I was wrong: `x <- c(1:9, 7:10, 5:22); y <- c(letters, letters[1:5]); test <- data.frame(x, y); test[duplicated(test$x) | duplicated(test$x, fromLast=TRUE), ]` Returned all three of he copies of 7, 8, and 9. Why does that work? – JoeM05 Apr 09 '17 at 21:21
  • 2
    Because the middle ones are captured no matter if you start from the end or from the front. For example, `duplicated(c(1,1,1))` vs `duplicated(c(1,1,1,), fromLast = TRUE)` gives `c(FALSE,TRUE,TRUE)` and `c(TRUE,TRUE,FALSE)`. Middle value is `TRUE` in both cases. Taking `|` of both vectors gives `c(TRUE,TRUE,TRUE)`. – Brandon Mar 11 '18 at 02:36
42

You need to assemble the set of duplicated values, apply unique, and then test with %in%. As always, a sample problem will make this process come alive.

> vec <- c("a", "b", "c","c","c")
> vec[ duplicated(vec)]
[1] "c" "c"
> unique(vec[ duplicated(vec)])
[1] "c"
>  vec %in% unique(vec[ duplicated(vec)]) 
[1] FALSE FALSE  TRUE  TRUE  TRUE
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Agree. Might even slow down processing but unlikely to slow it down very much. – IRTFM Jun 21 '18 at 14:42
  • Quite true. The OP did not offer a data example to test for "ever duplicated" rows in a dataframe. I think my suggestion of using `duplicated`, `unique` and `%in%` could easily be generalized to a dataframe if one were to first `paste` each row with an unusual separator character. (The accepted answer is better.) – IRTFM Jun 06 '19 at 22:09
23

Duplicated rows in a dataframe could be obtained with dplyr by doing

library(tidyverse)
df = bind_rows(iris, head(iris, 20)) # build some test data
df %>% group_by_all() %>% filter(n()>1) %>% ungroup()

To exclude certain columns group_by_at(vars(-var1, -var2)) could be used instead to group the data.

If the row indices and not just the data is actually needed, you could add them first as in:

df %>% add_rownames %>% group_by_at(vars(-rowname)) %>% filter(n()>1) %>% pull(rowname)
IRTFM
  • 258,963
  • 21
  • 364
  • 487
Holger Brandl
  • 10,634
  • 3
  • 64
  • 63
  • 2
    Nice use of `n()`. Don't forget to ungroup the resulting dataframe. – qwr Jul 02 '19 at 21:35
  • @qwr I've adjusted the answer to ungroup the result – Holger Brandl Jul 03 '19 at 07:21
  • @HolgerBrandl, @qwr, The general answer is useful, but I don't understand how to pick column(s) to exclude. What is the "vars" refer to in `group_by_at(vars(-var1, -var2))`? Are `var1` and `var2` column names in a datatable named `vars`? I assume the negative signs signify exclusion, right? So the rest of the process (`filter` and `ungroup`) acts on the rest of the columns in that datatable `vars`, but not including `var1` and `var2` is that right? Sorry to be so pedantic, but I often have problems with quick shorthand! – W Barker Jul 08 '21 at 12:19
  • `vars` is a method in dplyr, see https://dplyr.tidyverse.org/reference/vars.html . var1, var2 indeed refer to column names to be excluded from the duplication check. Duplication is assessed on the grouping variables in the suggested solution. Indeed, negative signifies exclusion. – Holger Brandl Jul 08 '21 at 20:23
  • 2
    `group_by_all()` and `group_by_at()` have been superseded in recent versions of dplyr. Now you can do this: `iris %>% group_by(across()) %>% filter(n() > 1) %>% ungroup()` – MCornejo Sep 29 '21 at 20:50
4

Here is @Joshua Ulrich's solution as a function. This format allows you to use this code in the same fashion that you would use duplicated():

allDuplicated <- function(vec){
  front <- duplicated(vec)
  back <- duplicated(vec, fromLast = TRUE)
  all_dup <- front + back > 0
  return(all_dup)
}

Using the same example:

vec <- c("a", "b", "c","c","c") 
allDuplicated(vec) 
[1] FALSE FALSE  TRUE  TRUE  TRUE

canderson156
  • 1,045
  • 10
  • 24
3

I've had the same question, and if I'm not mistaken, this is also an answer.

vec[col %in% vec[duplicated(vec$col),]$col]

Dunno which one is faster, though, the dataset I'm currently using isn't big enough to make tests which produce significant time gaps.

François M.
  • 4,027
  • 11
  • 30
  • 81
  • 1
    This answer seems to use `vec` both as an atomic vector and as a dataframe. I suspect that with an actual datframe it would fail. – IRTFM Jun 21 '18 at 14:44
2

I had a similar problem but I needed to identify duplicated rows by values in specific columns. I came up with the following dplyr solution:

df <- df %>% 
  group_by(Column1, Column2, Column3) %>% 
  mutate(Duplicated = case_when(length(Column1)>1 ~ "Yes",
                            TRUE ~ "No")) %>%
  ungroup()

The code groups the rows by specific columns. If the length of a group is greater than 1 the code marks all of the rows in the group as duplicated. Once that is done you can use Duplicated column for filtering etc.

Adnan Hajizada
  • 101
  • 1
  • 4
2

This is how vctrs::vec_duplicate_detect() works

# on a vector
vctrs::vec_duplicate_detect(c(1, 2, 1))
#> [1]  TRUE FALSE  TRUE
# on a data frame
vctrs::vec_duplicate_detect(mtcars[c(1, 2, 1),])
#> [1]  TRUE FALSE  TRUE

Created on 2022-07-19 by the reprex package (v2.0.1)

IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38
0

If you are interested in which rows are duplicated for certain columns you can use a plyr approach:

ddply(df, .(col1, col2), function(df) if(nrow(df) > 1) df else c())

Adding a count variable with dplyr:

df %>% add_count(col1, col2) %>% filter(n > 1)  # data frame
df %>% add_count(col1, col2) %>% select(n) > 1  # logical vector

For duplicate rows (considering all columns):

df %>% group_by_all %>% add_tally %>% ungroup %>% filter(n > 1)
df %>% group_by_all %>% add_tally %>% ungroup %>% select(n) > 1

The benefit of these approaches is that you can specify how many duplicates as a cutoff.

qwr
  • 9,525
  • 5
  • 58
  • 102
0

This updates @Holger Brandl's answer to reflect recent versions of dplyr (e.g. 1.0.5), in which group_by_all() and group_by_at() have been superseded. The help doc suggests using across() instead.

Thus, to get all rows for which there is a duplicate you can do this: iris %>% group_by(across()) %>% filter(n() > 1) %>% ungroup()

To include the indices of such rows, add a 'rowid' column but exclude it from the grouping: iris %>% rowid_to_column() %>% group_by(across(!rowid)) %>% filter(n() > 1) %>% ungroup()

Append %>% pull(rowid) after the above and you'll get a vector of the indices.

MCornejo
  • 327
  • 1
  • 12