Use R to find duplicates in multiple columns at once

Question

I have a table with three columns: Surname, First Name and Address. I'm looking to match up families by searching for people with the same surname AND the same address. I figured out how to use Duplicated to filter down to only people with the same name OR the same address. Here's my sample table:

Surname First Name  Address
A1        Bobby       X1
B5         Joe        X2
B5        Mary        X3
F2        Lou         X4
F3        Sarah       X5
G4        Bobby       X6
G4        Fred        X6
G4        Anna        X6
H5        Eric        X7
K6        Peter       X8

And the code I used to filter it is:

duplicates = duplicated(sample$Surname)
sample_surnames= sample %>% filter(duplicates)

Here's the output of that code:

Surname First Name  Address         
B5        Mary        X3
G4        Fred        X6
G4        Anna        X6

The problem is two fold:

This code drops the first instance of any duplicate. i.e. Bobby, Fred and Anna should all be included but Bobby is dropped.
Is there a way to filter for duplicates in both the Surname and Address columns at once or do I need to perform the operation twice? To be clear: I'm looking for instances where there's a duplicate in BOTH columns.

Update: Here is the table I'd like to get in the end: Please note, I'm not trying to remove the duplicates but, rather, keep the duplicates. In this case, Bobby, Fred and Anna are the only ones who have both the same Surname and Address.

Surname First Name  Address
G4       Bobby       X6
G4        Fred       X6
G4       Anna        X6

The link to the duplicate question doesn't answer my question as far as I can tell. I've updated the question to clarify. — sreed, Apr 25 '17 at 16:53

akrun · Accepted Answer · 2017-04-25T17:00:14.063

5

We can use unique with by option from data.table

library(data.table)
unique(setDT(df), by = c("Surname", "Address"))
#    Surname First Name Address
#1:      A1      Bobby      X1
#2:      B5        Joe      X2
#3:      B5       Mary      X3
#4:      F2        Lou      X4
#5:      F3      Sarah      X5
#6:      G4      Bobby      X6
#7:      H5       Eric      X7
#8:      K6      Peter      X8

Or with tidyverse

library(dplyr)
df %>% 
  distinct(Surname, Address, .keep_all = TRUE)
# Surname First Name Address
#1      A1      Bobby      X1
#2      B5        Joe      X2
#3      B5       Mary      X3
#4      F2        Lou      X4
#5      F3      Sarah      X5
#6      G4      Bobby      X6
#7      H5       Eric      X7
#8      K6      Peter      X8

Update

Based on the updated post, perhaps this helps

setDT(df)[, if((uniqueN(FirstName))>1) .SD,.(Surname, Address)]
#   Surname Address FirstName
#1:      G4      X6     Bobby
#2:      G4      X6      Fred
#3:      G4      X6      Anna

edited Apr 25 '17 at 17:00

answered Apr 22 '17 at 03:26

akrun

874,273
37
540
662

I'm actually looking for more or less the opposite of this. I want to save only the instances that have both the same surname and the same address. I've updated the question to make it more clear. Thanks. – sreed Apr 25 '17 at 16:52
@sreed I updated the post. Can you check if this works? – akrun Apr 25 '17 at 17:00
2

That worked! Your awesome. Thank you! Now, to figure out what the code actually means.... – sreed Apr 25 '17 at 18:38

Use R to find duplicates in multiple columns at once

1 Answers1

Update