Alternative to Nested Loop and a funtion inside it in python

Question

I want to upgrade my code since the algorithm takes 17hours to compute. I tried several methods but did not work. could you please suggest me any alternative of the code in order to save time?

%%time

# test algorithm1 - fuzzy
matched_pair = []
for x in dataset1['full_name_eng']:
    for y in dataset2['name']:
        if (fuzz.token_sort_ratio(x,y) > 85):
            matched_pair.append((x,y))
            print((x,y))

I am comparing names from 2 dataset columns and find the matched pair.

The complexity of this is currently around O(dataset1^dataset2). Could you use a search tree instead? — 10 Rep, Apr 28 '20 at 17:18
I want to compare each name to each name in different columns and how can I use the search tree here? — Guram Keretchashvili, Apr 28 '20 at 17:25
Have you profiled your code? Is most of the time spent on the fuzz function or the rest? — Arya McCarthy, Apr 28 '20 at 17:33
Probably use something like Dask dataframes instead of pandas. Has a similar API, but you can more easily process the data in parallel. — Trenton McKinney, Apr 28 '20 at 17:34
If you really want to compare all then you end up for huge amount of operations which takes time. And probably that fuzzy compare is a complex thing to do. One approach would be save pairs which you have compared earlier to a database and first try to search result from there. — ex4, Apr 28 '20 at 17:34
Just to make sure.... do you have python-Levenshtein installed? — ex4, Apr 28 '20 at 17:36
@GuramKeretchashvili Sorry... I thought you were looking through a database instead. — 10 Rep, Apr 28 '20 at 18:05
What exactly does `token_sort_ratio` do? It may be possible to order `dataset2['name']` in such a way that once you find one pair `x,y` that doesn't meet the threshold, then you can terminate the inner loop early. — chepner, Apr 28 '20 at 19:24
fuzz.token_sort_ratio(x,y) this funtion just returns the number (similarity of 2 strings) — Guram Keretchashvili, Apr 28 '20 at 21:15

score 0 · Answer 1 · answered Apr 28 '20 at 19:19

One possibility is to introduce parallel processing. Currently this is single-threaded code so is probably not using all the CPU resources available. If you have 4 or 8 cores you should see a significant improvement from spreading the calculation over all of them.

I don't have much experience with this in Python. Here's an intro on one approach: Parallel Processing in python

In your case, you might want to write a function that compares a single value with each value in y. Then map that function over x using the parallel-processing framework.

Alternative to Nested Loop and a funtion inside it in python

1 Answers1