I am using a function to compare the sentences of two dataframes and extract the value and sentence with the highest similarity:
df1
: containing 40,000 sentencesdf2
: containing 400 sentences
Each sentence of df1
is compared against the 400 sentences of df2
, the function returns a tuple that contains the sentence with the highest value and the value. If the value returned is below 96, it returns a tuple with two None
values.
I have tried different algorithms for comparing the sentences (spacy
and nltk
), but i found the best results in terms of processing time by using fuzzywuzzy.process
-instance and its process.extractOne(text1, text2)
-method and dask
, by partitioning df1
(into 4 partitions). Inside extractOne()
I am using the option score_cutoff = 96
, to return only the values >= 96
, the idea was that once a value >= 96
was found, the function does not have to iterate through the whole df2
, but it seems it does not work like that.
I also tried partitioning df2
inside the function, but the processing time is not better than iterating using a list comprehension in df2
(this has a comment in the code).
Here's my code:
def similitud( text1 ):
a = process.extractOne( text1,
[ df2['TITULO_PROYECTO'][i]
for i in range( len( df2 ) )
],
score_cutoff = 96
)
"""
a = process.extractOne( text1,
ddf2.map_partitions( lambda df2:
df2.apply( lambda row:
#row['TITULO_PROYECTO'],
axis = 1
),
meta = 'str'
).compute( sheduler = 'processses' ),
score_cutoff #= 96
)
"""
return ( None, None ) if a == None else a
tupla_values = ddf1.map_partitions( lambda df1:
df1.progress_apply( ( lambda row:
similitud( row['TITULO_PROYECTO'] )
),
axis = 1
),
meta = 'str'
).compute( scheduler = 'processes' )
How can I find a solution to reduce the processing time?