Is there a way to speed up this pandas function that extracts from a list by its index position?

Question

I'm using some machine learning from the SBERT python module to calculate the top K most common strings given an input coprus and a target corpus (in this case 100K vs 100K in size).

The module is pretty robust and gets the comparison done pretty fast,returning me a list of dictionaries containing the top-K most similar comparisons for each input string in the format:

{Corpus ID : Similarity_Score}

I can then wrap it up in a dataframe with the query string list used as an index. Getting me a dataframe in the format:

Query_String | Corpus_ID | Similarity_Score

The main time-sink with my approach however is matching up the Corpus ID with the string in the Corpus so I know what string the input is matched against. My current solution is using a combination of pandas apply with the pandarallel module:

def retrieve_match_text(row, corpus_list):

    dict_obj = row['dictionary']

    corpus_id = dict_obj['corpus_id'] #corpus ID is an integer representing the index of a list
    score = dict_obj['score']

    matched_corpus_keyword = corpus_list[corpus_id] #list index lookup (speed this up)

    return [matched_corpus_keyword, score]

.....
.....

# expand the dictionary into two columns and match the corpus KW to its ID
output_df[['Matched Corpus KW', 'Score']] = output_df.parallel_apply(
                            lambda x: pd.Series(retrieve_match_text(x, sentence_list_2)), axis=1)

This takes around 2 minutes to do for an input corpus of 100K against another corpus of 100K in size. However I'm dealing with a corpus in the size of several million so any further increase in speed here is welcomed.

creanion · Accepted Answer · 2022-06-25T12:49:14.363

If I read the question correctly, you have the columns: Query_String and dictionary (is this correct?).
And then corpus_id and score are stored in that dictionary.

Your first target with pandas should be to work in a pandas-friendly way. Avoid the nested dictionary, store values directly in columns. After that, you can use efficient pandas operations.

Indexing a list is not what is slow for you. If you do this correctly it can be a whole-table merge/join and won't need any slow row-by-row apply and dictionary lookups.

Step 1. If you do this:

target_corpus = pd.Series(sentence_list_2, name="target_corpus")

Then you have an indexed series of one corpus (formerly the "list lookup").

Step 2. Get columns of score and corpus_id in your main dataframe

Step 3. Use pd.merge to join the input corpus on corpus_id vs the index of target_corpus and using how="left" (only items that match an existing corpus_id are relevant). This should be an efficient way to do it, and it's a whole-dataframe operation.

Develop and test the solution vs a small subset (1K) to iterate quicky then grow.

Is there a way to speed up this pandas function that extracts from a list by its index position?

1 Answers1