1

I have this dataset in R that looks something like this:

address = c("882 4N Road River NY, NY 12345", "882 - River Road NY, ZIP 12345", "123 Fake Road Boston Drive Boston", "123 Fake - Rd Boston 56789")
            
 name = c("ABC Center Building", "Cent. Bldg ABC", "BD Home 25 New", "Boarding Direct 25")

cluster = c("A", "A", "B", "B")
            
my_data = data.frame(address, name, cluster)

                            address                name cluster
1    882 4N Road River NY, NY 12345 ABC Center Building       A
2    882 - River Road NY, ZIP 12345      Cent. Bldg ABC       A
3 123 Fake Road Boston Drive Boston      BD Home 25 New       B
4        123 Fake - Rd Boston 56789  Boarding Direct 25       B
   

My goal is to learn how to remove "fuzzy duplicates" from this dataset - for example, in the above dataset, it is clear to a human that there are only 2 unique records. However, a computer would have difficulty in coming to this conclusion. Therefore, a "fuzzy based" technique has to be used to tackle this problem.

In a previous question(Removing Fuzzy Duplicates in R), I learned about different ways that can be used to remove "fuzzy" duplicates from this dataset. When I tried these methods (on my real data - 100,000 rows) - I got the following errors:

library(dplyr)
library(tidyr)
library(stringdist)

# METHOD 1

my_data_dists <- my_data %>% 
    mutate(row = row_number()) %>% 
    full_join(., ., by = character()) %>% 
    filter(row.x < row.y) %>% 
    mutate(
        address.dist = stringdist(address.x, address.y),
        name.dist = stringdist(name.x, name.y)
    ) %>% 
    arrange(scale(address.dist) + scale(name.dist)) %>% 
    relocate(
        row.x, row.y,
        address.dist, name.dist,
        address.x, address.y, 
        name.x, name.y
    )

Error: cannot allocate vector of size 237.6 Gb

# METHOD 2

> name_dists <- adist(my_data$name)
Error: cannot allocate vector of size 475.3 Gb

It seems like both of these methods require too much memory to run. Ultimately, I am interested in testing the following:

  • Test 1: Removing fuzzy duplicates based on name and address
  • Test 2: Removing fuzzy duplicates based only on the address

Does anyone know of any ways I might be able to solve this problem?

Thank you!

Note: I understand that this procedure will involve an exponential number of comparisons to be performed - in my example, I have included a "cluster" variable, and the deduplication can be performed within each cluster and not on the whole dataset. Therefore, smaller numbers of comparisons can be performed (e.g. 4C2 vs 2C2).

stats_noob
  • 5,401
  • 4
  • 27
  • 83
  • 2
    could you try to geocode the addresses and then filter on common lat long? – AndS. Nov 13 '22 at 14:13
  • @ AndS : thank you for your reply! I had considered this approach, but the adresses are in a messy format which would take a lot of time to standardize prior to geo coding. I thought that record linkage might be a better approach? Thanks! – stats_noob Nov 13 '22 at 18:03
  • 4
    Unless you control the data inflow, you might not be able to solve this. ie 2 completely different addresses might be close in terms of distance as compared to two same addresses. eg `123 Fake Road Boston Drive Boston` is closer to `123 Lake Road Huston Drive Huston` than it is to `123 Fake - Rd Boston 56789` yet the latter is the one you need to match. So there is no direct automation. try using `geocode` or even control the data collection – Onyambu Nov 24 '22 at 05:03
  • I second AndS. and onyambu comments. Usually geocode services are much more likely to resolve these ambiguities than an approach based on fuzzy matching. – nicola Nov 28 '22 at 10:49
  • I predicted this problem in my answer to your original question. Is there a specific reason why the answer falls short for your purposes? Why do you want to use distance instead of token similarity (which is much lighter on memory)? https://stackoverflow.com/a/74458813/5028841 – JBGruber Nov 30 '22 at 10:05

1 Answers1

1

I encountered a similar problem in 'fuzzy matching' between 2 databases. and the stringdist_left_join method of fuzzyjoin helped me a lot. especially with the qgram metric!

i hope the below example can help you. if not try to adjust the metrics which are noted in the comments. here is the code:

library(dplyr)
library(tidyr)
library(stringdist)
library(fuzzyjoin)


address = c("882 4N Road River NY, NY 12345", "882 - River Road NY, ZIP 12345","address 2 1 fortest",
            "123 Fake Road Boston Drive Boston", "123 Fake - Rd Boston 56789","address 2.1 for test","addres 2 1 for testt")

name = c("ABC Center Building", "Cent. Bldg ABC", "name 2 1 for test",
         "BD Home 25 New", "Boarding Direct 25","name 2.1 fortest","name 2.1 forr ttest")

cluster = c("A", "A","C",
            "B", "B","C","C")

my_data = data.frame(address, name, cluster)  
# > my_data
# address                name cluster
# 1    882 4N Road River NY, NY 12345 ABC Center Building       A
# 2    882 - River Road NY, ZIP 12345      Cent. Bldg ABC       A
# 3               address 2 1 fortest   name 2 1 for test       C
# 4 123 Fake Road Boston Drive Boston      BD Home 25 New       B
# 5        123 Fake - Rd Boston 56789  Boarding Direct 25       B
# 6              address 2.1 for test    name 2.1 fortest       C
# 7              addres 2 1 for testt name 2.1 forr ttest       C  

my_data_1= my_data%>%
   # ---add a column all_text that paste adress and name in order to reach more probability of fuzzy matching
   mutate(all_text=paste(address,name))
   

for( i in 1:nrow(my_data_1)){
   df2=my_data_1[i,]
  
   mydata_2= my_data_1[-c(i),]
     
   my_data_dists=stringdist_left_join(df2, mydata_2, 
                          by ="all_text",
                          # ---you can change the method , here is details of availble methods :https://www.rdocumentation.org/packages/stringdist/versions/0.9.4.6/topics/stringdist-metrics
                          method = "qgram", 
                          # ---you can adjust this parameter
                           q=1:4,
                          max_dist = 30, 
                          distance_col = "dist")%>%
      # *****here you can filter more if you want to delete just the first more pertinent duplication 
      # group_by(all_text.x) %>%
      # top_n(1, -dist)%>%
      # ungroup()%>%
      
      #  --------
      # *****here you can filter more if you want to fix an other condition: limit of character number, it helps a lot if there a big variation in your texts!

      # filter(dist<5|(dist>=5&
      #                   nchar(all_text.x)>18&
      #                   nchar(all_text.y)>18))%>%
      #  --------
   
   select(c("address.y" , "name.y", "cluster.y" ,"all_text.y"))
   
   # remames colnames of my_data_dists
   names(my_data_dists)=names(my_data_1)
      
   
   if (nrow(my_data_dists)>=1)
   {
      #---------delete all other duplication and keep just current row
      my_data_1=  anti_join(my_data_1,my_data_dists)
   }
  
}
result=my_data_1%>%select(-all_text)   
result
# > result
# address                name cluster
# 1    882 4N Road River NY, NY 12345 ABC Center Building       A
# 2               address 2 1 fortest   name 2 1 for test       C
# 3 123 Fake Road Boston Drive Boston      BD Home 25 New       B

please let me know if more details are required.

islem
  • 236
  • 1
  • 6
  • note that the clusters are not taken into consideration in the comparison, they were only put for the example. – islem Nov 29 '22 at 18:32