3

I have a dataset that looks something like this in R:

address = c("882 4N Road River NY, NY 12345", "882 - River Road NY, ZIP 12345", "123 Fake Road Boston Drive Boston", "123 Fake - Rd Boston 56789")
            
 name = c("ABC Center Building", "Cent. Bldg ABC", "BD Home 25 New", "Boarding Direct 25")
            
my_data = data.frame(address, name)

                            address                name
1    882 4N Road River NY, NY 12345 ABC Center Building
2    882 - River Road NY, ZIP 12345      Cent. Bldg ABC
3 123 Fake Road Boston Drive Boston      BD Home 25 New
4        123 Fake - Rd Boston 56789  Boarding Direct 25

Looking at this data, it is clear that the first two rows are the same and the second two rows are the same. However, if you tried to remove duplicates directly, standard functions (e.g. "distinct()") would state that there are no duplicates in this dataset, seeing as all rows have some unique element.

I have been trying to research different methods in R that are able to de-duplicate rows based on "fuzzy conditions".

Based on the answers provided here (Techniques for finding near duplicate records), I came across this method called "Record Linkage". I came across this specific tutorial over here (https://cran.r-project.org/web/packages/RecordLinkage/vignettes/WeightBased.pdf) that might be able to perform a similar task, but I am not sure if this is intended for the problem I am working on.

  • Can someone please help me confirm if this Record Linkage tutorial is in fact relevant to the problem I am working on - and if so, could someone please show me how to use it?

  • For example, I would like to remove duplicates based on the name and address - and only have two rows remaining (i.e. one row from row1/row2 and one row from row3/row4 - which ever one is chosen doesn't really matter).

  • As another example - suppose I wanted to try this and only de-duplicate based on the "address" column: is this also possible?

Can someone please show me how this could work?

Thank you!

Note: I have heard some options about using SQL JOINS along with FUZZY JOINS (e.g. https://cran.r-project.org/web/packages/fuzzyjoin/readme/README.html) - but I am not sure if this option is also suitable.

stats_noob
  • 5,401
  • 4
  • 27
  • 83
  • 5
    Yep, you're in record linkage territory. For addresses in particular, you may want to look into tokenising (i.e., parsing the string into lot numbers, street name, postcode etc), or even geo-coding each line and then using a standardised form. It's not at all a straight-forward task and there's no clear 'answer' to this that wouldn't just be rewriting a textbook. – thelatemail Nov 08 '22 at 23:45
  • @ thelatemail: thank you for your answer! I think you are right - there is likely no straightforward way to solve this problem. But I was wondering, perhaps there might be some relatively straightforward way that might not be perfect - but can still make significant progress on this problem. Could some of the record linkage approaches in the links I posted be applied on this problem? If you had time, could you please show me how to do this? Thank you so much! – stats_noob Nov 08 '22 at 23:57
  • 1
    Have a look at something like the slides here - https://rpubs.com/ahmademad/RecordLinkage - for a worked example. The problem you have with your data is that you need to get cleaned separate fields to even try to do linkage. Which means something like this: https://stackoverflow.com/questions/68029796/parse-address-strings-in-r will need to be done first. – thelatemail Nov 09 '22 at 00:19
  • Thanks again ... I am looking and learning more about this! – stats_noob Nov 09 '22 at 16:32
  • 1
    You could have a look at the `refineR` package, which implements some of the fuzzy matching algorithms used in OpenRefine https://openrefine.org/. In my experience this is a difficult problem and always requires quite a lot of manual checking and correction, but OpenRefine is often a good place to start. – Andrew Gustar Nov 11 '22 at 09:52
  • Thank you! I wonder if the Levenstein Distance (LD) can be calculated between different combinations of rows and then decide if the LD is smaller than some certain threshold, then those two rows are the same row ... and then delete them? But if you have a dataset of 1000 rows, that would involve checking 499500 comparasions? – stats_noob Nov 11 '22 at 17:14
  • 1
    if your address data is not complete random, you could try setting up a workflow using the (non-CRAN) `postmastr`-package. https://slu-opengis.github.io/postmastr/articles/postmastr.html – Wimpel Nov 14 '22 at 12:14
  • 1
    An alternative solution might be to use an API, such as the one provided by Google Maps - you could geocode the results, effectively taking advantage of their address-parsing algorithms rather than creating your own. – Captain Hat Nov 14 '22 at 14:19
  • @ Captain Hat: Thanks for the reply! I was hoping to not have to pay money for this task ... but if this is the only option, I will look into it! – stats_noob Nov 14 '22 at 14:50
  • Right now - I am trying to see if I can solve the problem this way ... https://stackoverflow.com/questions/74427483/performing-record-linkage-in-r – stats_noob Nov 14 '22 at 14:50
  • Its not that simple to use distances, for example. address like `123 Lake Road Huston Drive Huston ` how will you distinguish it from `123 Fake Road Boston Drive Boston`??? Sorry, but unless you can control the data collection, this cannot be fully automated. Note that the distance between these two DIFFERENT addresses is way much smaller than the distance between the two similar addresses given above. – Onyambu Nov 15 '22 at 00:40
  • 1
    With the bing or Google API you can geolocate for free a fair amount of addresses IIRC (try ggmap::geocode) then you can fuzzy match the coordinates and either curate manually or apply some sanity checks – moodymudskipper Nov 15 '22 at 07:04
  • 1
    I found this tutorial here for Google API https://rpubs.com/michaeldgarber/geocode . In general, I am very paranoid about providing credit card information seeing as I have been overcharged in the past and it caused me a lot of stress to resolve the situation. I would prefer approaches in which credit card information does not need to be provided. – stats_noob Nov 15 '22 at 07:55
  • Ah, I did this back in the day when no card was required. But I had better luck with bing in any case since they had a very high limit of calls. I believe I adapted this code, not sure if it works as is : https://stackoverflow.com/questions/23647605/geocoding-with-r-and-googlemaps-api#38965270 (or if bing wants your card too now) – moodymudskipper Nov 16 '22 at 19:50

3 Answers3

6

For tasks like this, I like to use a divide and conquer strategy as you quickly run into memory issues comparing a larger number of strings or longer strings.

packages

library(tidyverse)
library(quanteda)
library(quanteda.textstats)
library(stringdist)

phase 1: token similarity

I add an ID column and combine name and address into fulltext for comparison.

my_data2 <-  my_data|>
  mutate(ID = factor(row_number()),
         fulltext = paste(name, address))

In the quanteda approach to similarity is to divide strings into words/tokens before comparing which tokens are the same in two strings. This is extremely efficient compared to string distance:

duplicates <- my_data2 |> 
  # a bunch of wrangling to create the quanteda dfm object
  corpus(docid_field = "ID",
         text_field = "fulltext") |> 
  tokens() |> 
  dfm() |> 
  # calculate similarity using cosine (other methods are available)
  textstat_simil(method = "cosine") |> 
  as_tibble() |>
  # attaching the original documents back to the output 
  left_join(my_data2, by = c("document1" = "ID")) |> 
  left_join(my_data2, by = c("document2" = "ID"), suffix = c("", "_comparison"))

duplicates |> 
  select(cosine, 
         address, address_comparison, 
         name, name_comparison)
#> # A tibble: 5 × 5
#>   cosine address                           address_comparison      name  name_…¹
#>    <dbl> <chr>                             <chr>                   <chr> <chr>  
#> 1 0.641  882 4N Road River NY, NY 12345    882 - River Road NY, Z… ABC … Cent. …
#> 2 0.0801 882 4N Road River NY, NY 12345    123 Fake Road Boston D… ABC … BD Hom…
#> 3 0.0833 882 - River Road NY, ZIP 12345    123 Fake Road Boston D… Cent… BD Hom…
#> 4 0.0962 882 - River Road NY, ZIP 12345    123 Fake - Rd Boston 5… Cent… Boardi…
#> 5 0.481  123 Fake Road Boston Drive Boston 123 Fake - Rd Boston 5… BD H… Boardi…
#> # … with abbreviated variable name ¹​name_comparison

As you can see, the first and second, as well as the third and fourth entries have a rather high similarity with 0.641 and 0.481 respectively. This comparison can already be enough to identify duplicates in most cases. However, it completely ignores word order. The classic example is that "Dog bites man" and "Man bites dog" have a token similarity of 100%, yet an entirely different meaning. Look into your dataset to figure out if the order of tokens plays a role or not. If you think it does, read on.

phase 2: string similarity

String similarity as implemented in stringdist is a normalised version of the distance. See for distance, the length of the texts you compare plays no role. However, two 4 letter strings with two letters differing is very dissimilar while the same is not true for two 100 letter strings. Your example looks like this might not be a big issue, but in general, I prefer similarity for that reason.

The problem with string similarity and distance, however, is that they are computationally very costly. Even a couple of 100 short text can quickly take up your entire memory. So what you can do is to filter the results above and only calculate string similarity on the candidates which already look like they are duplicates:

duplicates_stringsim <- duplicates |> 
  filter(cosine > 0.4) |> 
  mutate(stringsim = stringsim(fulltext, fulltext_comparison, method = "lv"))

duplicates_stringsim |> 
  select(cosine, stringsim,
         address, address_comparison, 
         name, name_comparison)
#> # A tibble: 2 × 6
#>   cosine stringsim address                           address_com…¹ name  name_…²
#>    <dbl>     <dbl> <chr>                             <chr>         <chr> <chr>  
#> 1  0.641     0.48  882 4N Road River NY, NY 12345    882 - River … ABC … Cent. …
#> 2  0.481     0.354 123 Fake Road Boston Drive Boston 123 Fake - R… BD H… Boardi…
#> # … with abbreviated variable names ¹​address_comparison, ²​name_comparison

For comparison, the stringsim for the other three comparison that we have already eliminated are 0.2, 0.208 and 0.133. Even though a little smaller, the string similarities confirm the results from phase 1.

Now the final step is to remove the duplicates from the original data.frame. For this I use another filter, pull out the IDs from the duplicates_stringsim object and then remove these duplicates from the data.

dup_ids <- duplicates_stringsim |> 
  filter(stringsim > 0.3) |> 
  pull(document2)


my_data2 |> 
  filter(!ID %in% dup_ids)
#>                             address                name ID
#> 1    882 4N Road River NY, NY 12345 ABC Center Building  1
#> 2 123 Fake Road Boston Drive Boston      BD Home 25 New  3
#>                                             fulltext
#> 1 ABC Center Building 882 4N Road River NY, NY 12345
#> 2   BD Home 25 New 123 Fake Road Boston Drive Boston

Created on 2022-11-16 with reprex v2.0.2

Note that I chose the cutoff values based on your requirements for the example. You will have to fine tune these for your dataset and likely all new projects.

JBGruber
  • 11,727
  • 1
  • 23
  • 45
  • @ JBGruber: thank you so much for your answer! Just a question - suppose I only had 1 column of data ... or suppose I had 4 columns of data .... in theory, could the same code you have provided still work? Thank you so much! – stats_noob Nov 16 '22 at 14:30
  • 1
    Yes, totally. Look at this line `fulltext = paste(name, address)`. It combines the two relevant columns into one. You can also just use one or 4 or 20 columns here. – JBGruber Nov 16 '22 at 14:31
  • Thank you so much! When I get home I will try this! – stats_noob Nov 16 '22 at 16:42
  • I am working on a related problem here, but this time I am trying to use a Record Linkage approach: https://stackoverflow.com/questions/74427483/performing-record-linkage-in-r . Do you have any ideas about this? Thank you so much! – stats_noob Nov 16 '22 at 16:43
  • It looks like you already got a good answer over there. – JBGruber Nov 18 '22 at 14:35
3

stringdist::stringdist() can be useful for finding near-duplicates, at least in relatively simple cases.

With your example data, we can perform a cartesian self-join to get all combinations of rows; use stringdist::stringdist() to compute distances* for all row-pairs for address and name; and arrange with most similar row-pairs first:

library(dplyr)
library(tidyr)
library(stringdist)

my_data_dists <- my_data %>% 
  mutate(row = row_number()) %>% 
  full_join(., ., by = character()) %>% 
  filter(row.x < row.y) %>% 
  mutate(
    address.dist = stringdist(address.x, address.y),
    name.dist = stringdist(name.x, name.y)
  ) %>% 
  arrange(scale(address.dist) + scale(name.dist)) %>% 
  relocate(
    row.x, row.y,
    address.dist, name.dist,
    address.x, address.y, 
    name.x, name.y
  )
  row.x row.y address.dist name.dist                         address.x                         address.y              name.x             name.y
1     1     2           13        13    882 4N Road River NY, NY 12345    882 - River Road NY, ZIP 12345 ABC Center Building     Cent. Bldg ABC
2     3     4           15        16 123 Fake Road Boston Drive Boston        123 Fake - Rd Boston 56789      BD Home 25 New Boarding Direct 25
3     2     3           25        13    882 - River Road NY, ZIP 12345 123 Fake Road Boston Drive Boston      Cent. Bldg ABC     BD Home 25 New
4     1     3           25        15    882 4N Road River NY, NY 12345 123 Fake Road Boston Drive Boston ABC Center Building     BD Home 25 New
5     2     4           23        17    882 - River Road NY, ZIP 12345        123 Fake - Rd Boston 56789      Cent. Bldg ABC Boarding Direct 25
6     1     4           25        18    882 4N Road River NY, NY 12345        123 Fake - Rd Boston 56789 ABC Center Building Boarding Direct 25

From here, you can manually weed out duplicates, or eyeball the results to choose a distance threshold to consider rows "duplicates." If we take the latter approach: it looks like name.dist may not be a reliable metric (e.g., one of the lowest values is a false positive), but address.dist scores below 20 seem reliable. You can then apply this to filter your original data.

dupes <- my_data_dists$row.y[my_data_dists$address.dist < 20]

my_data[-dupes,]
                            address                name
1    882 4N Road River NY, NY 12345 ABC Center Building
3 123 Fake Road Boston Drive Boston      BD Home 25 New

For more complex cases (e.g., more columns, very large datasets), you're likely better off with RecordLinkage or some of the other suggestions in the comments. But I've found stringdist flexible and helpful for cases involving just a few columns.

Edit: An alternative interface is provided by stringdist::stringdistmatrix() or utils::adist(), which return a dist object or matrix of distances among elements of one or two vectors:

stringdistmatrix(my_data$name)
#    1  2  3
# 2 13      
# 3 15 13   
# 4 18 17 16

adist(my_data$name)
#      [,1] [,2] [,3] [,4]
# [1,]    0   13   15   18
# [2,]   13    0   13   17
# [3,]   15   13    0   16
# [4,]   18   17   16    0

Edit 2: I've added some more information in response to OP's questions in a gist.


* stringdist functions use optimal string alignment by default, but other metrics can be specified in the method argument.

zephryl
  • 14,633
  • 3
  • 11
  • 30
  • @ zephryl: a question I had - suppose I only had the "name" column. Would it be possible to adjust your code so that it only performs this de-duplication on the "name" column? Thank you so much! – stats_noob Nov 12 '22 at 06:54
  • 1
    You would use the same code, just remove `address.dist` from the `mutate()` call, and use `my_data_dists$name.dist` with an appropriate threshold instead of `my_data_dists$address.dist` to get the `dupes` indices… but I have a feeling I’m misunderstanding your question? – zephryl Nov 12 '22 at 13:22
  • 1
    I’ve edited my answer to mention alternative functions that return a matrix of distances, which may be useful for the single-vector case. I’d also look through the [other functions](https://www.rdocumentation.org/packages/stringdist/versions/0.9.10) available in stringdist. – zephryl Nov 12 '22 at 14:45
  • Thank you so much for your update! I am reading the documentation and trying to understand the differences between the stringdist() and the adist() functions. Do you have a preference out of these? – stats_noob Nov 12 '22 at 17:06
  • For example, in the outputs generated by the adist function - for any given row in the matrix, if a column value is less than some arbitrary value "x", is it possible to delete that column? – stats_noob Nov 12 '22 at 17:11
  • 2
    My response is a little long-winded so I put it in a [gist](https://gist.github.com/ccsarapas/081ad03b4f0104450d2e9fc9c306121f). As for preference, I prefer the stringdist package because it has a range of useful functions for approximate matching, whereas I think base R is limited to `base::agrep()` and `utils::adist()`. stringdist also implements ~10 different distance metrics, while base only does Levenshtein. – zephryl Nov 12 '22 at 19:38
  • @ Zephryl: Regarding your original answer - I tried running your code and got the following message: Error: cannot allocate vector of size 237.6 Gb – stats_noob Nov 13 '22 at 06:51
  • 1
    Is there any way around this error? Is this problem more suitable for Record Linkage instead? – stats_noob Nov 13 '22 at 06:52
  • Probably. RecordLinkage or possibly [refinr](https://cran.r-project.org/web/packages/refinr/vignettes/refinr-vignette.html). If nothing else, those will be better optimized for this kind of thing than my ad hoc code. – zephryl Nov 13 '22 at 10:33
  • 1
    May be look at [this](https://stackoverflow.com/questions/48058104/efficient-string-similarity-grouping/48096986#comment83197508_48096986) – MrSmithGoesToWashington Nov 14 '22 at 09:35
  • @ MrSmithGoesToWashington: Thank you your reply! If you have time, could you please show me an example as to how this function "alphagrep" can be used in my example? Thank you so much! – stats_noob Nov 14 '22 at 14:52
  • @ zephryl: I posted a related question over here : https://stackoverflow.com/questions/74427483/performing-record-linkage-in-r - I tried some of the record linkage options but they dont seem to be working. if you have time, could you please take a look at it later? thank you so much! – stats_noob Nov 14 '22 at 14:53
  • @antonoyaro8 did you see [this question](https://stackoverflow.com/q/57007402/17303805)? There’s an answer there using RecordLinkage. – zephryl Nov 15 '22 at 03:36
1

Using agrep and a setting of max.distance = list(all = 0.6) (60%) gives good results when using both, address and name and only address. May vary slightly if used on larger data sets.

agrep
Approximate String Matching (Fuzzy Matching)

  • max.distance: Maximum distance allowed for a match.
    ‘all’: maximal number/fraction of all transformations (insertions, deletions and substitutions)

Filter uniques, keeping the first entry (Can be adjusted to use longest entry etc).

my_data[unique(sapply(paste(my_data$address, my_data$name), function(x)
  agrep(x, paste(my_data$address, my_data$name), 
    max.distance = list(all = 0.6))[1])),]
                            address                name
1    882 4N Road River NY, NY 12345 ABC Center Building
3 123 Fake Road Boston Drive Boston      BD Home 25 New

Using only address

my_data[unique(sapply(my_data$address, function(x) 
  agrep(x, my_data$address, max.distance = list(all = 0.6))[1])),]
                            address                name
1    882 4N Road River NY, NY 12345 ABC Center Building
3 123 Fake Road Boston Drive Boston      BD Home 25 New
Andre Wildberg
  • 12,344
  • 3
  • 12
  • 29