Extract similar words from a corpus

Question

I want to extract similar words from a corpus. The similarity is based on string. Namely, when the string of two words are highly similar, two words extract as similar words. For example, If the corpus contains: Aras, bahro, arasis, adkpo, bah, aras sd, kio.

Similar words:

1- aras, arasis, aras sd

2- bahro, bah

how to solve this problem? Thanks.

score 0 · Answer 1 · answered Aug 28 '14 at 08:24

0

Levenshtein distance is a metric for measuring the difference between two sequences of words, perhaps you can take a sequences of words and compute the distance to know if they are similar.

answered Aug 28 '14 at 08:24

salmuz

44
1
3

I would add a reference to something like that: http://stackoverflow.com/questions/10136470/unsupervised-clustering-with-unknown-number-of-clusters – Yasen Aug 28 '14 at 09:11
you can check my answer on a similar problem: http://stackoverflow.com/questions/24150440/unable-to-follow-the-intuition-behind-minimum-edit-distance/24151217#24151217 – Pierre Aug 28 '14 at 14:42

Extract similar words from a corpus

1 Answers1