0

I'm not sure whether this question is repeated or not.But,I know like to know more about the optimized Levenshtein Distance Algorithm Implementation in R or Java or Python.I have a Text File which contains numerous strings line by line(close to 2000 records as shown below) in alphabetical order which might have some kind of similarity between them.Now,I want to compare all the pairs of strings in the file and output the distance matrix.Also,please let me know how to use this matrix to filter set strings based on my requirement say LD <=2.

Get back to me if the question is not clear and you need more information.

Sample Text File
----------------
abc
abcd
abe
bac
bad
back
blade
cub
cube
cute
dump
duke
Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
maddy
  • 113
  • 2
  • 8
  • You could start by looking at `?adist`. Is `adist(c("abc", "abcd", "abe", "bac", "bad", "back", "blade", "cub", "cube", "cute", "dump", "duke"))` what you are looking for? – Jota Mar 15 '14 at 06:50
  • http://www.markvanderloo.eu/yaRb/2013/02/26/the-stringdist-package/ – Marc in the box Mar 15 '14 at 07:56
  • Does this help? [http://stackoverflow.com/questions/21511801/text-clustering-with-levenshtein-distances/21513338#21513338] – jlhoward Mar 15 '14 at 14:34

1 Answers1

0

So this can be done in a slightly reversed manner. Create your dictionary d = {word:[] for word in file}. Now:

for word in d:
    for neighbor in edit_distance_1(word):
        if neighbor in d:
            d[word].append(neighbor)

Now d will be a graph of all words to their edit-distance-1 neighbors. You can trace these edges further to get edit distance 2 words (via other words), which I believe is what you want.

U2EF1
  • 12,907
  • 3
  • 35
  • 37