How to use edit_distance() from nltk.metrics in this example?

Question

I have a bit of problem with using edit_distance() in the following example. I need to print words from the languages mentioned in the languages list in 5 columns, which is not a problem. I have done that:

from nltk.corpus import swadesh
from nltk.metrics import *
from transliterate import translit
languages = ['be', 'bg', 'bs', 'ru', 'cs']

for lang in languages:
    print('{:10}'.format(lang),end='')
print()
for i in range(len(swadesh.words('be'))):
    for lang in languages:
        print('{:10}'.format(swadesh.words(lang)[i].split(',')[0]),end='')
    print()

This parts works as it is suppose to work. Now I need to measure the Levensthein string edit distance between words from 'be' langauge and the equivalent of this word in other languages. And the distance should appear after each word in the brackets. So it should look like, for example:

tamto(0) acela(5) oni(5) то(3)

What would you suggest to be the best idea to measure it? I was thinking about crating dictionaries:

for i in languages:
    words = swadesh.words(i)
    d[i] = words
print(d)

And then calculate edit distance somehow, but I cannot execute this. Especially beacue one of the languages - Russian has different script which means that I have to uste translit (correct me if I am wrong, this is what I found online). Do you have any tips how to go about it? I am new to programming so maybe it is a simple question for you, but I am still trying to figure out my way around everything in nltk. Thank you in advance!

score 0 · Answer 1 · answered Jun 08 '20 at 16:03

First of all, I really suggest using googletrans module which uses Google Translate API. You can install it via pip simply by running:

pip install googletrans

Now, let's get into the code which is pretty straight-forward actually:

import nltk
from nltk.corpus import swadesh
from googletrans import Translator


translator = Translator()
languages = ['be', 'bg', 'bs', 'ru', 'cs']

for word in swadesh.words('be'):
    for lang in languages[1:]:
        translated = translator.translate(word, src="be", dest=lang).text
        lev_dist = nltk.edit_distance(word, translated)
        print(f"Language: {lang}, Word: {word}, Translation: {translated}, Distance: {lev_dist}")

#Language: bg, Word: я, Translation: аз, Distance: 2
#Language: bs, Word: я, Translation: ja, Distance: 2
#Language: ru, Word: я, Translation: я, Distance: 0
#Language: cs, Word: я, Translation: já, Distance: 2
#...
#...

Thank you very much! I was thinking about this solution and using googletrans, but this is a part of larger task for course in NLP so I don't know if it will be accepted, bcause we were suppose to use this "swadesh.words()". Thank you anyways, I can see that yours is a better solution overall (so for my own future "projects" it will be extremly useful) — White, Jun 08 '20 at 16:15
You should know that `googletrans` isn't the official way to interact with Google Translate API. So, if this is a suitable solution for you, consider purchasing the Google API as this will fail after a few trials due to the restrictions of the free version — Anwarvic, Jun 08 '20 at 16:21

How to use edit_distance() from nltk.metrics in this example?

1 Answers1