fuzzy matching in one dataframe in R

Question

I have one dataframe with less than 5000 rows (csv file). I have plenty of columns, one of them is the company name. However, there are many duplicates with different names, for example, one company can be called: HH 785 EN

And his duplicate could be called : HH 785EN or HH784 EN

Every duplicates have like 1 or 2 differents characters from the original company.

I'm looking for an algorithm that could potentially detect these duplicates. Most of the fuzzy match problems I have seen have 2 datasets involved which isn't my case. I have seen many algorithm which takes one word and a list as entry, but I want to check my whole column of companies names with itself.

Thanks for your help.

Please [see here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) on posting an R question that we can answer, including a representative sample of data and the code you've written so far — camille, Jul 18 '18 at 02:30

MSW Data · Answer 1 · 2018-07-17T12:46:17.897

0

I think you are looking for agrep function that does Levenshtein distance. You can combine agrep with sapply to find the fuzzy match.

sapply(df$company_name,agrep,df$company_name)

edited Jul 17 '18 at 12:46

answered Jul 17 '18 at 12:36

MSW Data

441
3
8

fuzzy matching in one dataframe in R

1 Answers1

Linked