6

I want to find rows in a dataset where the values in all columns, except for one, match. After much messing around trying unsuccessfully to get duplicated() to return all instances of the duplicate rows (not just the first instance), I figured out a way to do it (below).

For example, I want to identify all rows in the Iris dataset that are equal except for Petal.Width.

require(tidyverse)
x = iris%>%select(-Petal.Width)
dups = x[x%>%duplicated(),]
answer =  iris%>%semi_join(dups)

> answer 
   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
1           5.1         3.5          1.4         0.2    setosa
2           4.9         3.1          1.5         0.1    setosa
3           4.8         3.0          1.4         0.1    setosa
4           5.1         3.5          1.4         0.3    setosa
5           4.9         3.1          1.5         0.2    setosa
6           4.8         3.0          1.4         0.3    setosa
7           5.8         2.7          5.1         1.9 virginica
8           6.7         3.3          5.7         2.1 virginica
9           6.4         2.8          5.6         2.1 virginica
10          6.4         2.8          5.6         2.2 virginica
11          5.8         2.7          5.1         1.9 virginica
12          6.7         3.3          5.7         2.5 virginica

As you can see, that works, but this is one of those times when I'm almost certain that lots other folks need this functionality, and that I'm ignorant of a single function that does this in fewer steps or a generally tidier way. Any suggestions?

An alternate approach, from at least two other posts, applied to this case would be:

answer = iris[duplicated(iris[-4]) | duplicated(iris[-4], fromLast = TRUE),]

But that also seems like just a different workaround instead of single function. Both approaches take the same amount of time. (0.08 sec on my system). Is there no neater/faster way of doing this?

e.g. something like iris%>%duplicates(all=TRUE,ignore=Petal.Width)

Paul Raftery
  • 223
  • 2
  • 6
  • 1
    Please do a small example with reproducible data and desired outcome. Thank you. – Andre Elrico Jul 12 '18 at 10:03
  • 3
    This is basically `iris[duplicated(iris[-4]) | duplicated(iris[-4], fromLast = TRUE),]` – David Arenburg Jul 12 '18 at 10:22
  • Yes, I saw that post before, which is quite similar to David's answer, but it also seems like a less than elegant solution to what I presume must be a relatively common task... is there no single function that returns all duplicates except one column? Or even just returns all duplicate entries? – Paul Raftery Jul 12 '18 at 10:39
  • Regarding "_Yes, I saw that post before,_" you should include such information in your post and clearly explain why previous answers don't meet your needs. Cheers – Henrik Jul 12 '18 at 10:53
  • I have added a dplyr solution which needs a few less characters to write and might be a little easier to read than the base R solution. However, I did not microbenchmark it or so to test speed. – Lennyy Jul 12 '18 at 11:43

3 Answers3

5
iris[duplicated(iris[,-4]) | duplicated(iris[,-4], fromLast = TRUE),]

Of duplicate rows (regardless of column 4) duplicated(iris[,-4]) gives the second row of the duplicate sets, rows 18, 35, 46, 133, 143 & 145, and duplicated(iris[,-4], fromLast = TRUE) gives the first row per duplicate set, 1, 10, 13, 102, 125 and 129. By adding | this results in 12 TRUEs, so it returns the expected output.

Or perhaps with dplyr: Basically you group on all variables except Petal.Width, count how much they occur, and filter those which occur more than once.

library(dplyr)
iris %>% 
  group_by_at(vars(-Petal.Width)) %>% 
  filter(n() > 1)

   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
          <dbl>       <dbl>        <dbl>       <dbl>    <fctr>
 1          5.1         3.5          1.4         0.2    setosa
 2          4.9         3.1          1.5         0.1    setosa
 3          4.8         3.0          1.4         0.1    setosa
 4          5.1         3.5          1.4         0.3    setosa
 5          4.9         3.1          1.5         0.2    setosa
 6          4.8         3.0          1.4         0.3    setosa
 7          5.8         2.7          5.1         1.9 virginica
 8          6.7         3.3          5.7         2.1 virginica
 9          6.4         2.8          5.6         2.1 virginica
10          6.4         2.8          5.6         2.2 virginica
11          5.8         2.7          5.1         1.9 virginica
12          6.7         3.3          5.7         2.5 virginica
Lennyy
  • 5,932
  • 2
  • 10
  • 23
  • Please provide some explanation rather than code only. Thanks! – petezurich Jul 12 '18 at 11:08
  • 1
    I added some explanation for my base R solution, and also added an approach with dplyr now. – Lennyy Jul 12 '18 at 11:35
  • 1
    On my dataset (~600K rows, 5 cols), the first option using just duplicated() is the fastest at 0.14s. However, the second using dplyr - the 'tidy' solution - reads as much more intuitive to me. It is slower (at 0.52s) than the first option though. Unsurprisingly the initial solution I proposed is the slowest (at 8.6s). Thanks! – Paul Raftery Jul 12 '18 at 12:48
1

I think janitor can do this somewhat directly.

library(janitor)

get_dupes(iris, !Petal.Width)

# get_dupes(iris, !Petal.Width)[,names(iris)] # alternative: no count column
   Sepal.Length Sepal.Width Petal.Length   Species dupe_count Petal.Width
1           4.8         3.0          1.4    setosa          2         0.1
2           4.8         3.0          1.4    setosa          2         0.3
3           4.9         3.1          1.5    setosa          2         0.1
4           4.9         3.1          1.5    setosa          2         0.2
5           5.1         3.5          1.4    setosa          2         0.2
6           5.1         3.5          1.4    setosa          2         0.3
7           5.8         2.7          5.1 virginica          2         1.9
8           5.8         2.7          5.1 virginica          2         1.9
9           6.4         2.8          5.6 virginica          2         2.1
10          6.4         2.8          5.6 virginica          2         2.2
11          6.7         3.3          5.7 virginica          2         2.1
12          6.7         3.3          5.7 virginica          2         2.5

steve_b
  • 105
  • 6
0

I looked into the source of duplicated but would be interested to see if anyone can find anything faster. It might involve going to Rcpp or something similar though. On my machine, the base method is the fastest but your original method is actually better than the most readable dplyr method. I think that wrapping a function like this for your own purposes ought to be sufficient, since your run times don't seem excessively long anyway you can simply do iris %>% opts("Petal.Width") for pipeability if that's the main concern.

library(tidyverse)
library(microbenchmark)

opt1 <- function(df, ignore) {
  ignore = enquo(ignore)
  x <- df %>% select(-!!ignore)
  dups <- x[x %>% duplicated(), ]
  answer <- iris %>% semi_join(dups)
}

opt2 <- function(df, ignore) {
  index <-  which(colnames(df) == ignore)
  df[duplicated(df[-index]) | duplicated(df[-index], fromLast = TRUE), ]
}

opt3 <- function(df, ignore){
  ignore <-  enquo(ignore)
  df %>%
    group_by_at(vars(-!!ignore)) %>%
    filter(n() > 1)
}


microbenchmark(
  opt1 = suppressMessages(opt1(iris, Petal.Width)),
  opt2 = opt2(iris, "Petal.Width"),
  opt3 = opt3(iris, Petal.Width)
)
#> Unit: milliseconds
#>  expr      min       lq     mean   median       uq       max neval cld
#>  opt1 3.427753 4.024185 4.851445 4.464072 5.069216 12.800890   100  b 
#>  opt2 1.712975 1.908130 2.403859 2.133632 2.542871  7.557102   100 a  
#>  opt3 6.604614 7.334304 8.461424 7.920369 8.919128 24.255678   100   c

Created on 2018-07-12 by the reprex package (v0.2.0).

Calum You
  • 14,687
  • 4
  • 23
  • 42