0

I want to extract similar words from a corpus. The similarity is based on string. Namely, when the string of two words are highly similar, two words extract as similar words. For example, If the corpus contains: Aras, bahro, arasis, adkpo, bah, aras sd, kio.

Similar words:

1- aras, arasis, aras sd

2- bahro, bah

how to solve this problem? Thanks.

SahelSoft
  • 615
  • 2
  • 9
  • 22

1 Answers1

0

Levenshtein distance is a metric for measuring the difference between two sequences of words, perhaps you can take a sequences of words and compute the distance to know if they are similar.

salmuz
  • 44
  • 1
  • 3
  • I would add a reference to something like that: http://stackoverflow.com/questions/10136470/unsupervised-clustering-with-unknown-number-of-clusters – Yasen Aug 28 '14 at 09:11
  • you can check my answer on a similar problem: http://stackoverflow.com/questions/24150440/unable-to-follow-the-intuition-behind-minimum-edit-distance/24151217#24151217 – Pierre Aug 28 '14 at 14:42