For tasks like this, I like to use a divide and conquer strategy as you quickly run into memory issues comparing a larger number of strings or longer strings.
packages
library(tidyverse)
library(quanteda)
library(quanteda.textstats)
library(stringdist)
phase 1: token similarity
I add an ID column and combine name and address into fulltext for comparison.
my_data2 <- my_data|>
mutate(ID = factor(row_number()),
fulltext = paste(name, address))
In the quanteda
approach to similarity is to divide strings into words/tokens before comparing which tokens are the same in two strings. This is extremely efficient compared to string distance:
duplicates <- my_data2 |>
# a bunch of wrangling to create the quanteda dfm object
corpus(docid_field = "ID",
text_field = "fulltext") |>
tokens() |>
dfm() |>
# calculate similarity using cosine (other methods are available)
textstat_simil(method = "cosine") |>
as_tibble() |>
# attaching the original documents back to the output
left_join(my_data2, by = c("document1" = "ID")) |>
left_join(my_data2, by = c("document2" = "ID"), suffix = c("", "_comparison"))
duplicates |>
select(cosine,
address, address_comparison,
name, name_comparison)
#> # A tibble: 5 × 5
#> cosine address address_comparison name name_…¹
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 0.641 882 4N Road River NY, NY 12345 882 - River Road NY, Z… ABC … Cent. …
#> 2 0.0801 882 4N Road River NY, NY 12345 123 Fake Road Boston D… ABC … BD Hom…
#> 3 0.0833 882 - River Road NY, ZIP 12345 123 Fake Road Boston D… Cent… BD Hom…
#> 4 0.0962 882 - River Road NY, ZIP 12345 123 Fake - Rd Boston 5… Cent… Boardi…
#> 5 0.481 123 Fake Road Boston Drive Boston 123 Fake - Rd Boston 5… BD H… Boardi…
#> # … with abbreviated variable name ¹name_comparison
As you can see, the first and second, as well as the third and fourth entries have a rather high similarity with 0.641 and 0.481 respectively. This comparison can already be enough to identify duplicates in most cases. However, it completely ignores word order. The classic example is that "Dog bites man" and "Man bites dog" have a token similarity of 100%, yet an entirely different meaning. Look into your dataset to figure out if the order of tokens plays a role or not. If you think it does, read on.
phase 2: string similarity
String similarity as implemented in stringdist is a normalised version of the distance. See for distance, the length of the texts you compare plays no role. However, two 4 letter strings with two letters differing is very dissimilar while the same is not true for two 100 letter strings. Your example looks like this might not be a big issue, but in general, I prefer similarity for that reason.
The problem with string similarity and distance, however, is that they are computationally very costly. Even a couple of 100 short text can quickly take up your entire memory. So what you can do is to filter the results above and only calculate string similarity on the candidates which already look like they are duplicates:
duplicates_stringsim <- duplicates |>
filter(cosine > 0.4) |>
mutate(stringsim = stringsim(fulltext, fulltext_comparison, method = "lv"))
duplicates_stringsim |>
select(cosine, stringsim,
address, address_comparison,
name, name_comparison)
#> # A tibble: 2 × 6
#> cosine stringsim address address_com…¹ name name_…²
#> <dbl> <dbl> <chr> <chr> <chr> <chr>
#> 1 0.641 0.48 882 4N Road River NY, NY 12345 882 - River … ABC … Cent. …
#> 2 0.481 0.354 123 Fake Road Boston Drive Boston 123 Fake - R… BD H… Boardi…
#> # … with abbreviated variable names ¹address_comparison, ²name_comparison
For comparison, the stringsim for the other three comparison that we have already eliminated are 0.2, 0.208 and 0.133. Even though a little smaller, the string similarities confirm the results from phase 1.
Now the final step is to remove the duplicates from the original data.frame. For this I use another filter, pull out the IDs from the duplicates_stringsim object and then remove these duplicates from the data.
dup_ids <- duplicates_stringsim |>
filter(stringsim > 0.3) |>
pull(document2)
my_data2 |>
filter(!ID %in% dup_ids)
#> address name ID
#> 1 882 4N Road River NY, NY 12345 ABC Center Building 1
#> 2 123 Fake Road Boston Drive Boston BD Home 25 New 3
#> fulltext
#> 1 ABC Center Building 882 4N Road River NY, NY 12345
#> 2 BD Home 25 New 123 Fake Road Boston Drive Boston
Created on 2022-11-16 with reprex v2.0.2
Note that I chose the cutoff values based on your requirements for the example. You will have to fine tune these for your dataset and likely all new projects.