-1

I have one dataframe with less than 5000 rows (csv file). I have plenty of columns, one of them is the company name. However, there are many duplicates with different names, for example, one company can be called: HH 785 EN

And his duplicate could be called : HH 785EN or HH784 EN

Every duplicates have like 1 or 2 differents characters from the original company.

I'm looking for an algorithm that could potentially detect these duplicates. Most of the fuzzy match problems I have seen have 2 datasets involved which isn't my case. I have seen many algorithm which takes one word and a list as entry, but I want to check my whole column of companies names with itself.

Thanks for your help.

BeijaxBI
  • 1
  • 1
  • Please [see here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) on posting an R question that we can answer, including a representative sample of data and the code you've written so far – camille Jul 18 '18 at 02:30

1 Answers1

0

I think you are looking for agrep function that does Levenshtein distance. You can combine agrep with sapply to find the fuzzy match.

sapply(df$company_name,agrep,df$company_name)
MSW Data
  • 441
  • 3
  • 8