I'm using some machine learning from the SBERT python module to calculate the top K most common strings given an input coprus and a target corpus (in this case 100K vs 100K in size).
The module is pretty robust and gets the comparison done pretty fast,returning me a list of dictionaries containing the top-K most similar comparisons for each input string in the format:
{Corpus ID : Similarity_Score}
I can then wrap it up in a dataframe with the query string list used as an index. Getting me a dataframe in the format:
Query_String | Corpus_ID | Similarity_Score
The main time-sink with my approach however is matching up the Corpus ID with the string in the Corpus so I know what string the input is matched against. My current solution is using a combination of pandas apply
with the pandarallel module:
def retrieve_match_text(row, corpus_list):
dict_obj = row['dictionary']
corpus_id = dict_obj['corpus_id'] #corpus ID is an integer representing the index of a list
score = dict_obj['score']
matched_corpus_keyword = corpus_list[corpus_id] #list index lookup (speed this up)
return [matched_corpus_keyword, score]
.....
.....
# expand the dictionary into two columns and match the corpus KW to its ID
output_df[['Matched Corpus KW', 'Score']] = output_df.parallel_apply(
lambda x: pd.Series(retrieve_match_text(x, sentence_list_2)), axis=1)
This takes around 2 minutes to do for an input corpus of 100K against another corpus of 100K in size. However I'm dealing with a corpus in the size of several million so any further increase in speed here is welcomed.