1

I have two long lists, one with English words,the other with the Spanish translation from google.translate. The order corresponds exactly. e.g. english_list = ['prejudicial','dire','malignant','appalling', 'ratify'] spanish_list =['perjudicial', 'grave', 'maligno', 'atroz','ratificar']

I need to get all the words from the two lists that are more or less similar in terms of the letters

I first through about checking for similar letters at the beginning of the two words, but then realized that in some cases similar words have slightly different beginnings (such as "prejudicial" - "perjudicial")

The desired output is table with two columns under the headings "English" and "Spanish" that have the similar words but excludes those that look different:

English           Spanish


prejudicial       perjudicial
malignant       maligno
ratify               ratificar

John Aiton
  • 85
  • 6

2 Answers2

1

First, install: pip install -U python-Levenshtein

Then:

import Levenshtein
for a,b in zip( english, spanish ) :
    if Levenshtein.distance( a, b ) < 3 :    # close enough
        print 'similar words:', a, b

Here's an explanation how levenshtein works: https://en.wikipedia.org/wiki/Levenshtein_distance -- and if you prefer a different similarity metrics, you may do that as well, but this one is quite good and worked well for me in the past.

Levenshtein can calculate the ratio(...) as well:

    ratio(string1, string2)

    The similarity is a number between 0 and 1, it's usually equal or
    somewhat higher than difflib.SequenceMatcher.ratio(), because it's
    based on real minimal edit distance.
lenik
  • 23,228
  • 4
  • 34
  • 43
0

You could use difflib and check for their similarity ratio like,

$ cat similar.py

from difflib import SequenceMatcher

english_list = ['prejudicial','dire','malignant','appalling', 'ratify']
spanish_list =['perjudicial', 'grave', 'maligno', 'atroz','ratificar']

def similarity(a, b):
    return SequenceMatcher(None, a, b).ratio()


print('English', 'Spanish')
for eng, span in zip(english_list, spanish_list):
        if similarity(eng, span) >= 0.5:
            print(eng, span)

Output:

$ python3 similar.py
English Spanish
prejudicial perjudicial
malignant maligno
ratify ratificar

As as a side note, depending on your use case, you should check difflib Vs levenshtein

han solo
  • 6,390
  • 1
  • 15
  • 19