0

I want to upgrade my code since the algorithm takes 17hours to compute. I tried several methods but did not work. could you please suggest me any alternative of the code in order to save time?

%%time

# test algorithm1 - fuzzy
matched_pair = []
for x in dataset1['full_name_eng']:
    for y in dataset2['name']:
        if (fuzz.token_sort_ratio(x,y) > 85):
            matched_pair.append((x,y))
            print((x,y))

I am comparing names from 2 dataset columns and find the matched pair.

  • The complexity of this is currently around O(dataset1^dataset2). Could you use a search tree instead? – 10 Rep Apr 28 '20 at 17:18
  • 1
    I want to compare each name to each name in different columns and how can I use the search tree here? – Guram Keretchashvili Apr 28 '20 at 17:25
  • 1
    Have you profiled your code? Is most of the time spent on the fuzz function or the rest? – Arya McCarthy Apr 28 '20 at 17:33
  • Probably use something like Dask dataframes instead of pandas. Has a similar API, but you can more easily process the data in parallel. – Trenton McKinney Apr 28 '20 at 17:34
  • 1
    If you really want to compare all then you end up for huge amount of operations which takes time. And probably that fuzzy compare is a complex thing to do. One approach would be save pairs which you have compared earlier to a database and first try to search result from there. – ex4 Apr 28 '20 at 17:34
  • Just to make sure.... do you have python-Levenshtein installed? – ex4 Apr 28 '20 at 17:36
  • yes I have installed – Guram Keretchashvili Apr 28 '20 at 17:50
  • @GuramKeretchashvili Sorry... I thought you were looking through a database instead. – 10 Rep Apr 28 '20 at 18:05
  • What exactly does `token_sort_ratio` do? It may be possible to order `dataset2['name']` in such a way that once you find one pair `x,y` that doesn't meet the threshold, then you can terminate the inner loop early. – chepner Apr 28 '20 at 19:24
  • fuzz.token_sort_ratio(x,y) this funtion just returns the number (similarity of 2 strings) – Guram Keretchashvili Apr 28 '20 at 21:15

1 Answers1

0

One possibility is to introduce parallel processing. Currently this is single-threaded code so is probably not using all the CPU resources available. If you have 4 or 8 cores you should see a significant improvement from spreading the calculation over all of them.

I don't have much experience with this in Python. Here's an intro on one approach: Parallel Processing in python

In your case, you might want to write a function that compares a single value with each value in y. Then map that function over x using the parallel-processing framework.

Dave Costa
  • 47,262
  • 8
  • 56
  • 72