I have been working on matching the source set with master set of customer names while this can be achieved by using -adist
in R but now I have been using 2 million of source set with 500k of master set, here we cant use the adist
as it does not support long vectors, so I have chunked the data to small set now i have 70 k of source set and 20k of master set while here the data sets size varies and hence can not use adist
as it doesn't support variable size of sets and I have tried with various other ways to achieve the same by amatch
, pmatch
, agrep
but not much help, I have referred these
sites which I found but couldn't find solution.
- Super fuzzy name checking?
Faster R code for fuzzy name matching using agrep() for multiple patterns...?
- R: String Fuzzy Matching using jarowinkler
- Fuzzy string matching in r
I have tried with levenshteinDist
, levenshteinSim
and jarowinkler
but have problem implementing for huge dataframe , can i find solution for my data frame similar to this solution using jarowinkler for different size of sets