3

I am using a dataset of company names with that may contains not identical duplicates.

The list may contains : company A but also c.o.m.p.a.n.y A or comp A

Is there any python script using NLP for example that can find similar names from a dataset.

Thanks in advance

Amine
  • 31
  • 2
  • I guess you have to train another NLP network to preprocess data for another network) Some of the caseswhere there is something like 'c.o.m.p.a.n.y' you can just remove useless characters and leave only letters – Dmitry Barsukoff Apr 13 '22 at 21:33
  • Do you know the form of the possible duplicates? – OTheDev Apr 13 '22 at 22:12
  • Yes I do know the general form of duplicates but not all of them – Amine Apr 13 '22 at 22:16
  • maybe these three link help you : [link1](https://stackoverflow.com/questions/17388213/find-the-similarity-metric-between-two-strings) , [link2](https://stackoverflow.com/questions/55162668/calculate-similarity-between-list-of-words) , [link3](https://stackoverflow.com/questions/66919407/calculating-words-similarity-score-in-python) – I'mahdi Apr 13 '22 at 22:57

1 Answers1

2

You can use spacy to get similarities between 2 texts.

import spacy

nlp = spacy.load("en_core_web_md")  # make sure to use larger package!
doc1 = nlp("Coca-Cola")
doc2 = nlp("Pepsi")

doc3 = nlp("Company Coca-Cola")
doc4 = nlp("Company Pepsi-Cola")


print(doc1, "<->", doc2, doc1.similarity(doc2))
print(doc3, "<->", doc4, doc3.similarity(doc4))

With following similarities

Coca-Cola <-> Pepsi 0.6684898494102074
Company Coca-Cola <-> Company Pepsi-Cola 0.934960639746236
PleSo
  • 314
  • 1
  • 11