Removing Duplicates Error : cannot allocate vector of size 237.6 Gb

Question

I have this dataset in R that looks something like this:

address = c("882 4N Road River NY, NY 12345", "882 - River Road NY, ZIP 12345", "123 Fake Road Boston Drive Boston", "123 Fake - Rd Boston 56789")
            
 name = c("ABC Center Building", "Cent. Bldg ABC", "BD Home 25 New", "Boarding Direct 25")

cluster = c("A", "A", "B", "B")
            
my_data = data.frame(address, name, cluster)

                            address                name cluster
1    882 4N Road River NY, NY 12345 ABC Center Building       A
2    882 - River Road NY, ZIP 12345      Cent. Bldg ABC       A
3 123 Fake Road Boston Drive Boston      BD Home 25 New       B
4        123 Fake - Rd Boston 56789  Boarding Direct 25       B

My goal is to learn how to remove "fuzzy duplicates" from this dataset - for example, in the above dataset, it is clear to a human that there are only 2 unique records. However, a computer would have difficulty in coming to this conclusion. Therefore, a "fuzzy based" technique has to be used to tackle this problem.

In a previous question(Removing Fuzzy Duplicates in R), I learned about different ways that can be used to remove "fuzzy" duplicates from this dataset. When I tried these methods (on my real data - 100,000 rows) - I got the following errors:

library(dplyr)
library(tidyr)
library(stringdist)

# METHOD 1

my_data_dists <- my_data %>% 
    mutate(row = row_number()) %>% 
    full_join(., ., by = character()) %>% 
    filter(row.x < row.y) %>% 
    mutate(
        address.dist = stringdist(address.x, address.y),
        name.dist = stringdist(name.x, name.y)
    ) %>% 
    arrange(scale(address.dist) + scale(name.dist)) %>% 
    relocate(
        row.x, row.y,
        address.dist, name.dist,
        address.x, address.y, 
        name.x, name.y
    )

Error: cannot allocate vector of size 237.6 Gb

# METHOD 2

> name_dists <- adist(my_data$name)
Error: cannot allocate vector of size 475.3 Gb

It seems like both of these methods require too much memory to run. Ultimately, I am interested in testing the following:

Test 1: Removing fuzzy duplicates based on name and address
Test 2: Removing fuzzy duplicates based only on the address

Does anyone know of any ways I might be able to solve this problem?

Thank you!

Note: I understand that this procedure will involve an exponential number of comparisons to be performed - in my example, I have included a "cluster" variable, and the deduplication can be performed within each cluster and not on the whole dataset. Therefore, smaller numbers of comparisons can be performed (e.g. 4C2 vs 2C2).

could you try to geocode the addresses and then filter on common lat long? — AndS., Nov 13 '22 at 14:13
@ AndS : thank you for your reply! I had considered this approach, but the adresses are in a messy format which would take a lot of time to standardize prior to geo coding. I thought that record linkage might be a better approach? Thanks! — stats_noob, Nov 13 '22 at 18:03
Unless you control the data inflow, you might not be able to solve this. ie 2 completely different addresses might be close in terms of distance as compared to two same addresses. eg `123 Fake Road Boston Drive Boston` is closer to `123 Lake Road Huston Drive Huston` than it is to `123 Fake - Rd Boston 56789` yet the latter is the one you need to match. So there is no direct automation. try using `geocode` or even control the data collection — Onyambu, Nov 24 '22 at 05:03
I second AndS. and onyambu comments. Usually geocode services are much more likely to resolve these ambiguities than an approach based on fuzzy matching. — nicola, Nov 28 '22 at 10:49
I predicted this problem in my answer to your original question. Is there a specific reason why the answer falls short for your purposes? Why do you want to use distance instead of token similarity (which is much lighter on memory)? https://stackoverflow.com/a/74458813/5028841 — JBGruber, Nov 30 '22 at 10:05

score 1 · Answer 1 · answered Nov 29 '22 at 18:26

I encountered a similar problem in 'fuzzy matching' between 2 databases. and the stringdist_left_join method of fuzzyjoin helped me a lot. especially with the qgram metric!

i hope the below example can help you. if not try to adjust the metrics which are noted in the comments. here is the code:

library(dplyr)
library(tidyr)
library(stringdist)
library(fuzzyjoin)


address = c("882 4N Road River NY, NY 12345", "882 - River Road NY, ZIP 12345","address 2 1 fortest",
            "123 Fake Road Boston Drive Boston", "123 Fake - Rd Boston 56789","address 2.1 for test","addres 2 1 for testt")

name = c("ABC Center Building", "Cent. Bldg ABC", "name 2 1 for test",
         "BD Home 25 New", "Boarding Direct 25","name 2.1 fortest","name 2.1 forr ttest")

cluster = c("A", "A","C",
            "B", "B","C","C")

my_data = data.frame(address, name, cluster)  
# > my_data
# address                name cluster
# 1    882 4N Road River NY, NY 12345 ABC Center Building       A
# 2    882 - River Road NY, ZIP 12345      Cent. Bldg ABC       A
# 3               address 2 1 fortest   name 2 1 for test       C
# 4 123 Fake Road Boston Drive Boston      BD Home 25 New       B
# 5        123 Fake - Rd Boston 56789  Boarding Direct 25       B
# 6              address 2.1 for test    name 2.1 fortest       C
# 7              addres 2 1 for testt name 2.1 forr ttest       C  

my_data_1= my_data%>%
   # ---add a column all_text that paste adress and name in order to reach more probability of fuzzy matching
   mutate(all_text=paste(address,name))
   

for( i in 1:nrow(my_data_1)){
   df2=my_data_1[i,]
  
   mydata_2= my_data_1[-c(i),]
     
   my_data_dists=stringdist_left_join(df2, mydata_2, 
                          by ="all_text",
                          # ---you can change the method , here is details of availble methods :https://www.rdocumentation.org/packages/stringdist/versions/0.9.4.6/topics/stringdist-metrics
                          method = "qgram", 
                          # ---you can adjust this parameter
                           q=1:4,
                          max_dist = 30, 
                          distance_col = "dist")%>%
      # *****here you can filter more if you want to delete just the first more pertinent duplication 
      # group_by(all_text.x) %>%
      # top_n(1, -dist)%>%
      # ungroup()%>%
      
      #  --------
      # *****here you can filter more if you want to fix an other condition: limit of character number, it helps a lot if there a big variation in your texts!

      # filter(dist<5|(dist>=5&
      #                   nchar(all_text.x)>18&
      #                   nchar(all_text.y)>18))%>%
      #  --------
   
   select(c("address.y" , "name.y", "cluster.y" ,"all_text.y"))
   
   # remames colnames of my_data_dists
   names(my_data_dists)=names(my_data_1)
      
   
   if (nrow(my_data_dists)>=1)
   {
      #---------delete all other duplication and keep just current row
      my_data_1=  anti_join(my_data_1,my_data_dists)
   }
  
}
result=my_data_1%>%select(-all_text)   
result
# > result
# address                name cluster
# 1    882 4N Road River NY, NY 12345 ABC Center Building       A
# 2               address 2 1 fortest   name 2 1 for test       C
# 3 123 Fake Road Boston Drive Boston      BD Home 25 New       B

please let me know if more details are required.

note that the clusters are not taken into consideration in the comparison, they were only put for the example. — islem, Nov 29 '22 at 18:32

Removing Duplicates Error : cannot allocate vector of size 237.6 Gb

1 Answers1