0

my problem is as follows: Imagine I have a large Excel dataset that has a company identifier in column (A) and long string texts within cells in column (B). These texts in those cells contain many words, imagine it as being the company description. I need to find fuzzy duplicates in this database, i.e. an example could be:

In cell B2 I have:

"Amazon.com Inc (Amazon) is an online retailer and web service provider. The company provides products such as apparel, auto and industrial items, beauty and health products, electronics, grocery, books, games, jewellery, kids and baby products, movies, music, sports goods, toys, tools and other related products."

and in cell B222 I have:

"%& COMPANY DESCRIPTION: Amazon is an online retailer and web service provider. Amazon provides products such as apparel, auto and industrial items, beauty and health products, electronics, grocery, books, games, jewellery, kids and baby products, movies, music, sports goods, toys, tools and other related products. Amazon is a great company."

So my point is: is there a way to find B222 fast and somehow show in B2 that there is a fuzzy duplicate, e.g. with an 80% match, in B222?

I have tried multiple tools such as Ablebits and the Levenshtein Distance in VBA. However I am not 100% satisfied with the result.

Thank you for any help!

Best,

qrif
  • 1
  • 1
  • Does this answer your question? [Alternative to Levenshtein and Trigram](https://stackoverflow.com/questions/20162894/alternative-to-levenshtein-and-trigram) – user11222393 Jan 10 '23 at 11:51
  • Thank you. Seems to be closer to what I am looking for than what I have found so far! However, is there also an Excel VBA solution for this? – qrif Jan 10 '23 at 13:30

0 Answers0