2

I have 2 .csv files. One of them with new data has ~100 rows and another is a reference book with ~40k rows.

I want to compare all strings from the first file (one by one) to all strings from second and calculate the Levenshtein distance to the most similar string.

After that i need to create a third file with the results. (with all data from first file, the max Levenshtein distance, and the string from the second file)

For example:

File A (new data):

Spam

Foo

File B (reference book):

Bar 1 0

Spamm 2 1

Spann 3 0

Booo 1 0

Fooo 2 2

Bo 3 3

...

What i need (Result File), where n = Levenshtein distance:

Spam n Spamm

Foo n Fooo

For now my code is:

def calculate_Leven(source, ref, result_file):
    with open(source, 'r') as input1:
        with open(ref, 'r') as input2:
            with open(result_file, 'w') as csvoutput:
                reader1 = csv.reader(input1)
                reader2 = list(csv.reader(input2))
                writer = csv.writer(csvoutput)
                result = []
                headers = next(reader1)
                result.append(headers)
                for row1 in reader1:
                    print "First row"
                    max = 0
                    while max < 1
                        for row2 in reader2:
                            a = distance(row1[0],row2[0])
                            b = 1 - a/len(row1[0])
                            if b > max:
                                max = b
                                SKU = row2[0]
                    row1.append(max)
                    row1.append(SKU)
                    result.append(row1)
                writer.writerows(result)

Where distance is a function to calculate Levensthein distance.

This code works but is extremely slow. Is there a better way to structure this, or an alternative path that is more efficient? I have about 100 new files per day to check against the reference book, so the low speed is a bottleneck.

Vasily Bronsky
  • 435
  • 2
  • 12
  • Related: https://stackoverflow.com/questions/4173579/implementing-levenshtein-distance-in-python – John Y Sep 20 '17 at 13:19
  • I think problem not in Levenshtein algorythm. It's working fine and correct. I checked on small data for example 100 rows per table. Script works about 3 seconds – Vasily Bronsky Sep 20 '17 at 13:30
  • Multiple processes with the `concurrent.futures` module can speed this up if you have more than one cpu/core at hand. There's a backport of that module for Python 2.7. – BlackJack Sep 20 '17 at 16:14
  • @VasilyBronsky: I'm not saying it's wrong, I'm saying it's slow. Are you already using a 3rd-party Levenshtein package implemented in C or C++? If not, that is one way to extract more speed. Also check out [this](https://stackoverflow.com/questions/3183149/most-efficient-way-to-calculate-levenshtein-distance), [this](https://stackoverflow.com/questions/16278874/how-to-speed-up-levenshtein-distance-calculation), and [this](https://stackoverflow.com/questions/27274508/optimize-speed-of-levenshtein-distance-of-many-words). – John Y Sep 20 '17 at 21:31
  • I especially like [this answer](https://stackoverflow.com/a/3183199/95852). If you have tried everything Levenshtein-related and it's still not fast enough, you may have to resort to other comparison methods, such as [this](https://news.ycombinator.com/item?id=10461174). Maybe that would be good enough to use as-is; if not, maybe it's good enough to use as a kind of "first pass" filter to get the most likely candidates from the reference, and then use Levenshtein on that smaller set. – John Y Sep 20 '17 at 21:37
  • @JohnY Thank you! It's a very usefull information. I will be on that way! – Vasily Bronsky Sep 21 '17 at 07:14
  • Maybe you could show the code (including you Levenshtein distance function) on [Code Review](https://codereview.stackexchange.com/) – Serge Ballesta Sep 21 '17 at 09:04

2 Answers2

0

Not pertaining to your algorithm, but if you are running 100 files a day and the reference file doesn't change, it may be useful to create a master file (or a database) indexed with all of the already calculated values. If the files are large and there is a lot of redundant data, this could significantly increase your performance over time.

Preston Martin
  • 2,789
  • 3
  • 26
  • 42
0

if there's a chance you you have duplicate entries in one of the file you can sort -uniq them before processing

v2belleville
  • 189
  • 14