How to efficiently replace values in a large dataframe (100k+ rows) from another based on closest match?

Question

So I am using levenshire distance to find closest match and replace many values in a large data frame using this answer as a base:

import operator

def levenshteinDistance(s1, s2):
    if len(s1) > len(s2):
        s1, s2 = s2, s1

    distances = range(len(s1) + 1)
    for i2, c2 in enumerate(s2):
        distances_ = [i2+1]
        for i1, c1 in enumerate(s1):
            if c1 == c2:
                distances_.append(distances[i1])
            else:
                distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
        distances = distances_
    return distances[-1]

def closest_match(string, matchings):
    scores = {}
    for m in matchings:
        scores[m] = 1 - levenshteinDistance(string,m)
    
    return max(scores.items(), key=operator.itemgetter(1))[0]

So while replacing many values in a moderately large dataframe (100k+ rows) from another of the similar size as follows takes forever to run: (Running from last half hour ha!)

results2.products = [closest_match(string, results2.products) 
                    if string not in results2.products else string 
                    for string in results.products]

So is there a way to do this more efficiently? I added if-else condition for the same purpose so that if there is a direct match there wont be any calculations involved which also would have produced same result.

Sample Data

results:

   products
0, pizza
1, ketchup
2, salami
3, anchovy
4, pepperoni
5, marinara
6, olive
7, sausage
8, cheese
9, bbq sauce
10, stuffed crust

results2:

   products
0, salaaaami
1, kechap
2, lives
3, ppprn
4, pizzas
5, marinara
6, sauce de bbq
7, marinara sauce
8, chease
9, sausages
10, crust should be stuffed

I want values in results2 to be replaced by the closest match in results

can you paste some sample data and the expected output? – Vivek Kalyanarangan Nov 20 '20 at 19:16 — Vivek Kalyanarangan, Nov 20 '20 at 19:16

Aaj Kaal · Answer 1 · 2020-11-20T23:57:11.983

1

Use compiled Python.

Use Cython / CPython

Use PyPy aka Stackless Python

Use Numba for both your function as follows:

from numba import jit
@jit
def levenshteinDistance(s1, s2):
...

edited Nov 20 '20 at 23:57

answered Nov 20 '20 at 19:27

Aaj Kaal

1,205
1
9
8

psyco is dead since 2012. Even then its only for 32-bit python and obviously in 2020 I'm running 64-bit version: https://superuser.com/questions/195047/how-to-install-psyco-getting-no-python-installation-found-in-registry – Hamza Nov 20 '20 at 19:57
@jit supports (parallel = True, no_python = True) implementations for faster executions but both flags are unavailable for the datatypes I am using. without these flags I left it to run but warnings showed it falls back to string/ other datatypes I am using and had no obvious speed advantage. Even then I let it run but it didnt get anywhere after 2 hours. it was still running and I stopped it. – Hamza Nov 22 '20 at 20:07

Hamza · Accepted Answer · 2021-07-31T14:22:14.203

So I took a couple of measures to improve speed and have almost 4800x speedup.

Posting here to benefit anyone dealing with slow performance of any CPU-intensive task on pandas:

Instead of replacing all at once as in the question, I made a replacement dictionary to make replacements with unique values in each dataframe which made it to go from taking forever (I stopped after 2 hours) to a 2 minutes as there are many many repeated values. Thats a 60x speedup:
```
replacements = {string: closest_match(string, results2.products.unique())
              if string not in results2.products.unique() else string 
                for string in results.products.unique()}
results.replace({'products':replacements}, inplace = True)
```
I used a c-based implementation of calculating levenshtein distance making use of : editdistance library. On research I found that many such tasks have C-based implementations like matrix-multiplication and search algorithms etc. are readily available. Besides you can always write a module in C and make use of it in python. editdistance.eval('banana', 'bahama') took only 1.71 µs ± 289 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) in comparison with my defined function levenshteinDistance('banana', 'bahama') which took 34.4 µs ± 4.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) That’s a 20x speedup. Resultant closest_match() function:
```
import operator
import editdistance
def closest_match(string, matchings):
    scores = {}
    for m in matchings:
        scores[m] = 1 - editdistance.eval(string,m)

    return max(scores.items(), key=operator.itemgetter(1))[0]
```
Then I made use of all of my cores at once by parallelism. For that I have gone through various alternatives e.g. multiprocessing and threading but none of them gave a comparison as fast as of modin.pandas. Its minimal changes (just one line import modin.pands as pd in place of import pandas as pd) and work elegantly. It made previous runs around 4x faster.

Thus totaling a 4800x speedup which is massive and whole thing is running in a blink.

UPDATE

I got a massive amounts of products so was looking for further optimizations. Turned out vectorization was the answer. Now my updated closest match function looks like:

results.index = results.products
def closest_match_results_product(string):
    return results.products.apply(lambda x: editdistance.eval(string,x)).idxmin()

And applying this function looks like:

results2.products = results2.products.apply(closest_match_results_product)

Bam! More optimized than ever! Still if there is further room for improvement would love to see any suggestions!

Great answer definitely book marking this – Umar.H Dec 13 '20 at 19:59 — Umar.H, Dec 13 '20 at 19:59

How to efficiently replace values in a large dataframe (100k+ rows) from another based on closest match?

Sample Data

2 Answers2

So I took a couple of measures to improve speed and have almost 4800x speedup.

UPDATE

Linked

Related