I have this dataset in R that looks something like this:
address = c("882 4N Road River NY, NY 12345", "882 - River Road NY, ZIP 12345", "123 Fake Road Boston Drive Boston", "123 Fake - Rd Boston 56789")
name = c("ABC Center Building", "Cent. Bldg ABC", "BD Home 25 New", "Boarding Direct 25")
cluster = c("A", "A", "B", "B")
my_data = data.frame(address, name, cluster)
address name cluster
1 882 4N Road River NY, NY 12345 ABC Center Building A
2 882 - River Road NY, ZIP 12345 Cent. Bldg ABC A
3 123 Fake Road Boston Drive Boston BD Home 25 New B
4 123 Fake - Rd Boston 56789 Boarding Direct 25 B
My goal is to learn how to remove "fuzzy duplicates" from this dataset - for example, in the above dataset, it is clear to a human that there are only 2 unique records. However, a computer would have difficulty in coming to this conclusion. Therefore, a "fuzzy based" technique has to be used to tackle this problem.
In a previous question(Removing Fuzzy Duplicates in R), I learned about different ways that can be used to remove "fuzzy" duplicates from this dataset. When I tried these methods (on my real data - 100,000 rows) - I got the following errors:
library(dplyr)
library(tidyr)
library(stringdist)
# METHOD 1
my_data_dists <- my_data %>%
mutate(row = row_number()) %>%
full_join(., ., by = character()) %>%
filter(row.x < row.y) %>%
mutate(
address.dist = stringdist(address.x, address.y),
name.dist = stringdist(name.x, name.y)
) %>%
arrange(scale(address.dist) + scale(name.dist)) %>%
relocate(
row.x, row.y,
address.dist, name.dist,
address.x, address.y,
name.x, name.y
)
Error: cannot allocate vector of size 237.6 Gb
# METHOD 2
> name_dists <- adist(my_data$name)
Error: cannot allocate vector of size 475.3 Gb
It seems like both of these methods require too much memory to run. Ultimately, I am interested in testing the following:
- Test 1: Removing fuzzy duplicates based on name and address
- Test 2: Removing fuzzy duplicates based only on the address
Does anyone know of any ways I might be able to solve this problem?
Thank you!
Note: I understand that this procedure will involve an exponential number of comparisons to be performed - in my example, I have included a "cluster" variable, and the deduplication can be performed within each cluster and not on the whole dataset. Therefore, smaller numbers of comparisons can be performed (e.g. 4C2 vs 2C2).