Efficient matching between a large list of sentences and large list of words in python

Question

Introduction

I know that there are many related questions on this platform to this and I have been researching and trying out different approaches from multiple options available. However, I wanted to collate all this and get suggestions on this problem.

Problem

I have a large file (~20M records) with a few parameters. On loading into a pandas data frame, it looks like this:

id | words_in_sentences | freq

1 | [county,went,live] | 400

2 | [going,on,mht00] | 500

I have another file that has roughly 300k words.

words=['on','mht','live']

Expected result:

I want to check whether each list in the column 'words_in_sentences' contain any of these 300k words. It will be good to get one of the matching word as well.

id | words_in_sentences | freq | is_word_present | word
  1 | [county,went,live] | 400 | True | live

  2 | [going,on,mht00] | 500 | True | on

Approaches

Apply a look up function row-wise


        def lookup_none_simple(row):

            sent='words_in_sentences'
            
            
            row['matching_word']=[z for z in words if z in row[sent]]
                        
            return row
        
        none=none.apply(lambda row:lookup_none_simple(row),axis=1)

Approach 2:

Change the list to set and then apply the same function row wise.

row['matching_word']=set(words).intersection(set(row[sent]))

Use CountVectorizer for mapping words to numbers and go with the same logic.

Issues

All these approaches are time consuming. I ran these methods on a large instance with 64GB RAM, still there were memory errors.

Other approaches

Use multiprocessing

Use JOBLIB
Use Multiprocessing
Use DASK for the searches
Use SPARK ( Didn't try)

I am finding it extremely hard to reduce the time taken for this task. Can somebody give an advice.

What is the length of the lists in 'words_in_sentences' column. is it constant? — Rajesh Bhat, Sep 18 '20 at 12:35
Length of the lists in 'words_in_sentences' is between 3 to 20 words. On average, it will be around 6 words. — pnv, Sep 18 '20 at 15:59
https://stackoverflow.com/questions/7571635/fastest-way-to-check-if-a-value-exists-in-a-list - Very useful post — pnv, Sep 19 '20 at 13:58

Rajesh Bhat · Answer 1 · 2020-09-19T03:38:15.417

I don't really know how big a dataset this method can handle. But it works fine on small dataset. Maybe you can split the dataframe into batches of say 200K and apply this method in a loop. You can use the explode() method in pandas and then do a join with words.

import pandas as pd

df = pd.DataFrame({'words_in_sentences':[['county','went','live'], ['going','on','mht00', 'dum'],
                                    ['abc', 'def']],
                  'freq':[400, 500, 600]})
df['id'] = df.index

words=['on','mht','live']
df_words = pd.DataFrame({'words':words})

df_exp = df[['id', 'words_in_sentences']].explode('words_in_sentences')
df_exp = df_exp.merge(df_words, left_on = 'words_in_sentences', right_on = 'words')
df_matched = df_exp.groupby('id', as_index=False).agg({'words':list})
df_matched['is_word_present'] = True

df = df.merge(df_matched, on = 'id', how = 'left')
df['is_word_present'] = df.is_word_present.fillna(False)

Efficient matching between a large list of sentences and large list of words in python

1 Answers1