I have a dataset which contains a field with individual's name. Some of the names are similar with minute differences like 'CANON INDIA PVT. LTD' and 'CANON INDIA PVT. LTD.', 'Antila,Thomas' and 'ANTILA THOMAS', 'Z_SANDSTONE COOLING LTD' and 'SANDSTONE COOLING LTD' etc. I need to identify such fuzzy duplicates and create a new subset containing these records.I have a huge table containing such records,so, I'm just producing a sample.
| Name | City |
|-------------------------|:-------:|
| CANON PVT. LTD | Georgia |
| Antila,Thomas | Georgia |
| Greg | Georgia |
| St.Luke's Hospital | Georgia |
| Z_SANDSTONE COOLING LTD | Georgia |
| St.Luke's Hospital | Georgia |
| CANON PVT. LTD. | Georgia |
| SANDSTONE COOLING LTD | Georgia |
| Greg | Georgia |
| ANTILA,THOMAS | Georgia |
I want the output to be:
| Name | City |
|-------------------------|:-------:|
| CANON PVT. LTD | Georgia |
| CANON PVT. LTD. | Georgia |
| Antila,Thomas | Georgia |
| ANTILA,THOMAS | Georgia |
| Z_SANDSTONE COOLING LTD | Georgia |
| SANDSTONE COOLING LTD | Georgia |
I tried using RecordLinkage and agrep, but they give out the original data as output.
library(RecordLinkage)
ClosestMatch2 = function(string, stringVector){
distance = levenshteinSim(string, stringVector);
stringVector[distance == max(distance)]
}
Fuzzy_duplicate=ClosestMatch2(df$Name, df$Name)
The other method was:
lapply(df$Name, agrep, df$Name, value = TRUE)
Using agrep gives the output as vector indices. However, I want to extract all the records belonging to only those whose names are similar?