1

I am using a function to compare the sentences of two dataframes and extract the value and sentence with the highest similarity:

  • df1 : containing 40,000 sentences
  • df2 : containing 400 sentences

Each sentence of df1 is compared against the 400 sentences of df2, the function returns a tuple that contains the sentence with the highest value and the value. If the value returned is below 96, it returns a tuple with two None values.

I have tried different algorithms for comparing the sentences (spacy and nltk), but i found the best results in terms of processing time by using fuzzywuzzy.process-instance and its process.extractOne(text1, text2)-method and dask, by partitioning df1 (into 4 partitions). Inside extractOne() I am using the option score_cutoff = 96, to return only the values >= 96, the idea was that once a value >= 96 was found, the function does not have to iterate through the whole df2, but it seems it does not work like that.

I also tried partitioning df2 inside the function, but the processing time is not better than iterating using a list comprehension in df2 (this has a comment in the code).

Here's my code:

def similitud( text1 ):
    a = process.extractOne(   text1,
                            [ df2['TITULO_PROYECTO'][i]
                                                 for i in range( len( df2 ) )
                              ],
                              score_cutoff = 96
                            )
    """
    a = process.extractOne(   text1,
                              ddf2.map_partitions(   lambda df2:
                                                            df2.apply( lambda row:
                                                                             #row['TITULO_PROYECTO'],
                                                                       axis = 1
                                                                       ),
                                                     meta = 'str'
                                                   ).compute( sheduler = 'processses' ),
                              score_cutoff #= 96
                            )
    """

    return ( None, None ) if a == None else a
    
tupla_values = ddf1.map_partitions( lambda df1:
                                           df1.progress_apply( ( lambda row:
                                                             similitud( row['TITULO_PROYECTO'] )
                                                                 ),
                                                                 axis = 1
                                                               ),
                                    meta = 'str'
                                    ).compute( scheduler = 'processes' )

How can I find a solution to reduce the processing time?

user3666197
  • 1
  • 6
  • 50
  • 92
Ivan
  • 13
  • 2
  • What are your ***current, as-is, processing-times*** and what is your *expected / nice-to-have* **target processing time** for any such improved processing strategy? *( "Having no target, any way may lead you there" - a droplet of wisdom from Alice in the Wonderlands )* – user3666197 Nov 28 '20 at 10:58
  • 1
    yes you are right, i forgot to put the time, the current processing time is 2 hours and 16 minutes, using a quad core computer, my target is half that time. Tanks for your answer! – Ivan Nov 28 '20 at 17:46
  • In this case the just switch to use the C-compiled extension of *"python-Levenshtein (optional, **provides a 4-10x speedup** in String Matching"* ought be the way, having highest performance boost for lowest re-factoring costs, ought be not? – user3666197 Nov 28 '20 at 18:13
  • 1
    I didnt know about the C-compiled extension for python, i did a little research and you are right, this should enhance the performance, i'll try to implement it and use it in other cases, your comment was really helpful, thanks user3666197. – Ivan Nov 28 '20 at 18:34
  • glad to hear you appreciate my inputs above - feel free to click on any of my other answers to say these thanks in the StackOverflow way :o) ... https://stackoverflow.com/questions/4087280/approximate-cost-to-access-various-caches-and-main-memory/33065382#33065382 perhaps? – user3666197 Nov 28 '20 at 18:55

2 Answers2

1

FuzzyWuzzy does not really care about the score_cutoff parameter. It will simply iterate over all elements and then search for the one with the highest score and return it in case it is above your score_cutoff treshold.

You should try to do the same using RapidFuzz( I am the author ), which provides a very similar interface. As a difference, the process.extractOne()-method returns a tuple with the choice and score in FuzzyWuzzy,
whereas it returns a tuple with choice, score and the index in RapidFuzz.

RapidFuzz uses some fast approximations of the Levenshtein distance, that are guaranteed to return a score >= the actual score of the normalized Levenshtein distance. These fast approximations are used to exit early without performing the expensive Levenshtein calculation when the score is guaranteed to be below the score_cutoff. However it is not possible for it to exit after it finds the first element with a score above score_cutoff in process.extractOne()-processing either, since it searches for the extreme, for the indeed best match above the score_cutoff treshold (but it will exit early in case it finds a first match with a score of 100).

Also you do not need to create a list of all choices inside the similitud(), since both FuzzyWuzzy and RapidFuzz accept DataSeries aswell. So you could just use something like:

from rapidfuzz import process

def similitud( text1 ):
    a = process.extractOne( text1,
                            df2['TITULO_PROYECTO'],
                            score_cutoff = 96
                            )
    return ( None, None ) if a is None else ( a[0], a[1] )

process.extractOne() will always preprocess your strings by lowercasing them, replacing non alphanumeric characters like punctuation with whitespaces and removing whitespaces from the start and from the end of the strings.

You could preprocess the data in df2 beforehand, so you do not have to do this for each element of df1 that is compared. Afterwards you could only preprocess the elements of df1 on each iteration:

from rapidfuzz import process, utils

def similitud( text1 ):
    a = process.extractOne( utils.default_process( text1 ),
                            df2['PROCESSED_TITULO_PROYECTO'],
                            processor    = None,
                            score_cutoff = 96
                            )
    return ( None, None ) if a is None else ( a[0], a[1] )

I recently created a Benchmark that compares process.extractOne() for both FuzzyWuzzy-module (with Python-Levenshtein) and RapidFuzz-module (can be found here).

Benchmark Comparing process.extractOne in FuzzyWuzzy and RapidFuzz


Benchmarks :

The source of the reproducible-science DATA :

  • titledata.csv from FuzzyWuzzy ( dataset with 2764 titles, max choices length: 2500 )

The hardware the above graphed benchmarks were run on (specification) :

  • CPU: single core of a i7-8550U
  • RAM: 8 GB
  • OS: Fedora 32
user3666197
  • 1
  • 6
  • 50
  • 92
maxbachmann
  • 2,862
  • 1
  • 11
  • 35
  • Is the RapidFuzz code based on the C-extension Python-Levenshtein or does it go in some different direction or strategy? If not, how much faster or slower is the resulting RapidFuzz runtime performance compared to the classical FuzzyWuzzy+Python-Levenshtein extension for a pair of blocks, say 1E5 sentences, 1000-unicode-chars each on a say 4-core / 32 GB machine? Thanks for posting the performance comparisin Max ( kindly quote your device specs / memory-I/O channels count & O/S version, so that be somewhat reasonably quantitative in in-vitro / in-vivo expectations ) Thank you, Max. – user3666197 Nov 28 '20 at 18:10
  • 1
    The processing time has been reduced to 10.1 seconds using the rapidfuzz and process.extractOne() as suggested. Thankyou for the full explanation Max!! – Ivan Nov 28 '20 at 18:27
  • 1
    @user3666197 This was already performed with the version using python-Levenshtein. When I find the time I will create a more thorough description of the benchmarks for RapidFuzz and whats causing these improvements in my documentation. https://colab.research.google.com/drive/19jODPBEvmtUtGIdAkir-__adJ4RyZUq8?usp=sharing#scrollTo=KIhPUvFHvTDH shows a similar benchmark that includes the pure Python version of FuzzyWuzzy aswell – maxbachmann Nov 28 '20 at 21:56
  • btw in case you know how to benchmark these kind of algorithms I am always open for ideas, since my benchmarks are definetly not perfect. Feel free to open an issue on this over at GitHub ;) – maxbachmann Nov 28 '20 at 22:07
0

you can eliminate for loop inside your function and modify your function to take two arguments, and handle arguments in apply function. maybe:

tupla_values = ddf1.map_partitions(lambda df1: df1.progress_apply((lambda row: similitud(row['TITULO_PROYECTO'],row['TITULO_PROYECTO'])), axis = 1), meta = 'str').compute(scheduler = 'processes')

It's general solution, probably you need further consideration for your code.

Milad Yousefi
  • 191
  • 2
  • 9