I want to find rows in a dataset where the values in all columns, except for one, match. After much messing around trying unsuccessfully to get duplicated() to return all instances of the duplicate rows (not just the first instance), I figured out a way to do it (below).
For example, I want to identify all rows in the Iris dataset that are equal except for Petal.Width.
require(tidyverse)
x = iris%>%select(-Petal.Width)
dups = x[x%>%duplicated(),]
answer = iris%>%semi_join(dups)
> answer
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.1 1.5 0.1 setosa
3 4.8 3.0 1.4 0.1 setosa
4 5.1 3.5 1.4 0.3 setosa
5 4.9 3.1 1.5 0.2 setosa
6 4.8 3.0 1.4 0.3 setosa
7 5.8 2.7 5.1 1.9 virginica
8 6.7 3.3 5.7 2.1 virginica
9 6.4 2.8 5.6 2.1 virginica
10 6.4 2.8 5.6 2.2 virginica
11 5.8 2.7 5.1 1.9 virginica
12 6.7 3.3 5.7 2.5 virginica
As you can see, that works, but this is one of those times when I'm almost certain that lots other folks need this functionality, and that I'm ignorant of a single function that does this in fewer steps or a generally tidier way. Any suggestions?
An alternate approach, from at least two other posts, applied to this case would be:
answer = iris[duplicated(iris[-4]) | duplicated(iris[-4], fromLast = TRUE),]
But that also seems like just a different workaround instead of single function. Both approaches take the same amount of time. (0.08 sec on my system). Is there no neater/faster way of doing this?
e.g. something like iris%>%duplicates(all=TRUE,ignore=Petal.Width)