2

I have a dataset which contains a field with individual's name. Some of the names are similar with minute differences like 'CANON INDIA PVT. LTD' and 'CANON INDIA PVT. LTD.', 'Antila,Thomas' and 'ANTILA THOMAS', 'Z_SANDSTONE COOLING LTD' and 'SANDSTONE COOLING LTD' etc. I need to identify such fuzzy duplicates and create a new subset containing these records.I have a huge table containing such records,so, I'm just producing a sample.

| Name                    |   City  |
|-------------------------|:-------:|
| CANON PVT. LTD          | Georgia |
| Antila,Thomas           | Georgia |
| Greg                    | Georgia |
| St.Luke's Hospital      | Georgia |
| Z_SANDSTONE COOLING LTD | Georgia |
| St.Luke's Hospital      | Georgia |
| CANON PVT. LTD.         | Georgia |
| SANDSTONE COOLING LTD   | Georgia |
| Greg                    | Georgia |
| ANTILA,THOMAS           | Georgia |

I want the output to be:

| Name                    |   City  |
|-------------------------|:-------:|
| CANON PVT. LTD          | Georgia |
| CANON PVT. LTD.         | Georgia |
| Antila,Thomas           | Georgia |
| ANTILA,THOMAS           | Georgia |
| Z_SANDSTONE COOLING LTD | Georgia |
| SANDSTONE COOLING LTD   | Georgia |

I tried using RecordLinkage and agrep, but they give out the original data as output.

library(RecordLinkage)
ClosestMatch2 = function(string, stringVector){
  distance = levenshteinSim(string, stringVector);
  stringVector[distance == max(distance)]
  }
Fuzzy_duplicate=ClosestMatch2(df$Name, df$Name)

The other method was:

lapply(df$Name, agrep, df$Name, value = TRUE)

Using agrep gives the output as vector indices. However, I want to extract all the records belonging to only those whose names are similar?

Jazz
  • 125
  • 7
  • I would first do some data cleaning using the tm library and regex: E.g. Remove punctuation, make all letters upper/lower case, remove excessive whitespace, remove common company suffixes (e.g. LTD) and then match using a metric (levenshtein is an option, but I think there are more suitable ones). After that the subsetting should be rather easy. – Koot6133 Jul 12 '19 at 13:38
  • @Chase I went through a similar post https://stackoverflow.com/questions/6044112/how-to-measure-similarity-between-string. How can I extract all the records as a dataframe containing similar records instead of vector indices? – Jazz Jul 12 '19 at 17:26

3 Answers3

1

Using RecordLinkage::compare.dedup and enabling the string comparison with the strcmp argument gives good results in your case.

library(RecordLinkage)
library(tibble)

df <- tribble(~Name,~City,
  "CANON PVT. LTD"          , "Georgia" ,
  "Antila,Thomas"           , "Georgia" ,
  "Greg"                    , "Georgia" ,
  "St.Luke's Hospital"      , "Georgia" ,
  "Z_SANDSTONE COOLING LTD" , "Georgia" ,
  "St.Luke's Hospital"      , "Georgia" ,
  "CANON PVT. LTD."         , "Georgia" ,
  "SANDSTONE COOLING LTD"   , "Georgia" ,
  "Greg"                    , "Georgia" ,
  "ANTILA,THOMAS"           , "Georgia")
df_comp <- compare.dedup(df,exclude = c(2),strcmp = T)
weight_list <- epiWeights(df_comp) 
getPairs(weight_list,single.rows=T,min.weight=0.90)
id1 Name.1 City.1 id2 Name.2 City.2 Weight
3 3 Greg Georgia 9 Greg Georgia 1.0000000
4 4 St.Luke's Hospital Georgia 6 St.Luke's Hospital Georgia 1.0000000
1 1 CANON PVT. LTD Georgia 7 CANON PVT. LTD. Georgia 0.9866667
5 5 Z_SANDSTONE COOLING LTD Georgia 8 SANDSTONE COOLING LTD Georgia 0.9710145

I used a min.weight threshold of 0.9 but first false positive appear at 0.63 which gives a good margin.

0
If you know for certain that all records are duplicated, either perfectly or approximately (as in example df).

In this case getting just the approximate duplicates is easy -- you just get all the ones that aren't perfect duplicates. Below is an example using the dplyr::filter() and duplicated()

library(dplyr)
library(tibble)

d <- tibble::tribble(
  ~Name, ~ City,
  "CANON PVT. LTD"           , "Georgia",
  'Antila,Thomas'            , "Georgia",
  "Greg"                     , "Georgia",
  "St.Luke's Hospital"       , "Georgia",
  "Z_SANDSTONE COOLING LTD"  , "Georgia",
  "St.Luke's Hospital"       , "Georgia",
  "CANON PVT. LTD."          , "Georgia",
  "SANDSTONE COOLING LTD"    , "Georgia",
  "Greg"                     , "Georgia",
  "ANTILA,THOMAS"            , "Georgia"  
)

d %>% 
  filter(!Name %in% Name[duplicated(Name)]) #%>%
  # arrange(Name) #arrange together if you want

# A tibble: 6 × 2
  Name                    City   
  <chr>                   <chr>  
1 CANON PVT. LTD          Georgia
2 Antila,Thomas           Georgia
3 Z_SANDSTONE COOLING LTD Georgia
4 CANON PVT. LTD.         Georgia
5 SANDSTONE COOLING LTD   Georgia
6 ANTILA,THOMAS           Georgia
Assuming the harder case where you don't know for certain that all the records are duplicates (either exact matches or close matches), and you want to extract only the approximate duplicates.

In this case you've got to find some approximate string matching algorithm and apply each item iteratively to all the other items. I've used the stringdist package, cause I'm familiar with it.

Another thing to be careful of though, is that approximate string matching will also match with the perfect duplicates, which we don't want to include in our output, so we need to remove perfect duplicates while returning.

In the example below, dd includes some extra Names that are not duplicates at all (either approximate or perfect), to check that this works.

library(stringdist)

dd <- tibble::tribble(
  ~Name, ~ City,
  "CANON PVT. LTD"           , "Georgia",
  'Antila,Thomas'            , "Georgia",
  "Greg"                     , "Georgia",
  "St.Luke's Hospital"       , "Georgia",
  "Z_SANDSTONE COOLING LTD"  , "Georgia",
  "St.Luke's Hospital"       , "Georgia",
  "CANON PVT. LTD."          , "Georgia",
  "SANDSTONE COOLING LTD"    , "Georgia",
  "Greg"                     , "Georgia",
  "ANTILA,THOMAS"            , "Georgia",
  "MRS BOTHAM"               , "Georgia",
  "BIG COMPANY LTD"          , "Georgia"
)

#function to return indices of approximate matches
get_amatches <- function(x, method = "osa", maxDist = 3) {
  res <- lapply(x, \(y) {
    am <- stringdist::ain(tolower(x), tolower(y), method = method, maxDist = maxDist) #approx match (incl. perf match)
    pm <- x %in% y # is perfect match
    which(am & !pm) # index of am but not pm
  })
  unique(unlist(res))
}

#return approximate matches in df
dd %>% 
  slice(get_amatches(Name)) %>% 
  arrange(Name)

Which gives the following output:


# A tibble: 6 × 2
  Name                    City   
  <chr>                   <chr>  
1 Antila,Thomas           Georgia
2 ANTILA,THOMAS           Georgia
3 CANON PVT. LTD          Georgia
4 CANON PVT. LTD.         Georgia
5 SANDSTONE COOLING LTD   Georgia
6 Z_SANDSTONE COOLING LTD Georgia
Josh White
  • 1,003
  • 1
  • 17
0

For very heavy data, I would use something like this (but don't know how this solution behave compared to others solutions provided ) :

rm(list=ls())
gc()

#  packages needed
for (package in c('data.table', 'stringr', 'stringdist')) {
  if (!require(package, character.only = TRUE, quietly = TRUE)) {
    install.packages(package, dependencies = TRUE)
    library(package, character.only = TRUE)
  }
}

# Data
tbl.data <- data.table(name = c("CANON PVT. LTD", "Antila,Thomas", "Greg", "St.Luke's Hospital", 
                              "Z_SANDSTONE COOLING LTD", "St.Luke's Hospital", 
                              "CANON PVT. LTD.",
                              "SANDSTONE COOLING LTD",
                              "Greg", "ANTILA,THOMAS"))

# A string with just letters
tbl.data[, string_ := str_replace_all(tolower(str_replace_all(name, "[^[:graph:]]", "")),"[[:punct:][:digit:]]", "")]

# Create a table for comprisons ----
################################ -
tbl.a <- tbl.data[, .(string_1 = string_, names1 = name, one = 1, id1 = .I)]
tbl.b <- tbl.data[, .(string_2 = string_,  names2 =name, one = 1, id2 = .I)]
  # avoid unneeded lines :
tbl.a <- unique(tbl.a)
tbl.b <- unique(tbl.b)
  # the table of all couples :
tbl.dedoubl <- merge(tbl.a, tbl.b, by = "one",  allow.cartesian = TRUE)
  # avoid unneeded comparisons :
tbl.dedoubl <- tbl.dedoubl[id1 < id2]

# Distance calculations : ----
################################ -
tbl.dedoubl[, distance := stringdist(string_1, string_2, method = "lv")]

# Lines suspected to be doubles : ----
################################### -
tbl.dedoubl[distance <= 2]