0

I'm working with a big dataset of names and need to be able to group by the individual. It's possible that in the dataset there are names that appear different but are the same person, such as John Doe or John A. Doe, or Michael Smith and Mike Smith. Is there a way for R to find instances like these and recognize them as the same person?

df <- data.frame(
  name = c("John Doe", "John A. Doe", "Jane Smith", "Jane Anderson", "Jane Anderson Lowell",
           "Jane B. Smith", "John Doe", "Jane Smith", "Michael Smith",
"Mike Smith", "A.K. Ross", "Ana Kristina Ross"),
  rating = c(1,2,1,1,2,3,1,4,2,1,3,2)
)

Here, there are multiple repeated individuals, whether the variant be a middle initial, a shortened name, a lengthened name, or someone whose last name changed. I've been trying to find a function that could give a similarity percentage of characters in name matches, and from there I could manually examine cases of high percentage to evaluate if they are indeed the same person. My end goal is to find the average rating by person, where I would need to sort by the individual.

Mary
  • 27
  • 4
  • 2
    You're getting into the realm of probabilistic data linkage/matching to do this thoroughly. There's plenty of string distance packages in R like https://cran.r-project.org/web/packages/phonics/vignettes/phonics.html and full-blown packages like RecordLinkage - https://cran.r-project.org/web/packages/RecordLinkage/index.html, as well as a limited set of built-in functionality, as I show here: https://stackoverflow.com/q/27975705/496803 – thelatemail Jun 16 '21 at 02:55
  • Possible duplicate of https://stackoverflow.com/questions/6683380/techniques-for-finding-near-duplicate-records?noredirect=1&lq=1 – thelatemail Jun 16 '21 at 02:57

1 Answers1

3

There are many algorithms that measure string distance. Here is a simple approach for this example dataset using stringdist package. As suggested by the documentation of stringdist() function, Jaro-Winkler distance is used to find the string distance between a name pair. Note that I only paired the names with the same first two letters. Through eye-balling, a string distance of 0.15 seems to be a reasonable threshold to define a match.

library(tidyverse)
library(stringdist)

get_string_distance <- function(x) {
  if (length(x) == 1) {
    data.frame(name1 = x, name2 = x, string_distance = NA_real_)
  } else {
    x %>% 
      unique() %>% 
      combn(2) %>% 
      t() %>% 
      as.data.frame() %>% 
      setNames(c("name1", "name2")) %>% 
      mutate(string_distance = stringdist(name1, name2, method = "jw"))
  }
}

dat <- df %>% 
  mutate(two_letters = str_sub(name, 1, 2)) %>% 
  nest_by(two_letters) %>% 
  mutate(same_name = list(get_string_distance(data$name))) %>% 
  ungroup()

dat1 <- dat %>% 
  unnest(same_name) %>% 
  filter(string_distance < 0.15) %>% 
  select(name1, name2, string_distance)

dat1

# # A tibble: 4 x 3
#   name1         name2                string_distance
#   <chr>         <chr>                          <dbl>
# 1 Jane Smith    Jane B. Smith                 0.0769
# 2 Jane Anderson Jane Anderson Lowell          0.117 
# 3 John Doe      John A. Doe                   0.0909
# 4 Michael Smith Mike Smith                    0.136 
Zaw
  • 1,434
  • 7
  • 15
  • built-in `agrep` doesn't return distances, but it does check for approximate equality (using Levenshtein distance) – Ben Bolker Jun 16 '21 at 03:39