0

Introduction

I know that there are many related questions on this platform to this and I have been researching and trying out different approaches from multiple options available. However, I wanted to collate all this and get suggestions on this problem.

Problem

I have a large file (~20M records) with a few parameters. On loading into a pandas data frame, it looks like this:

id | words_in_sentences | freq

1 | [county,went,live] | 400

2 | [going,on,mht00] | 500

I have another file that has roughly 300k words.

words=['on','mht','live']

Expected result:

I want to check whether each list in the column 'words_in_sentences' contain any of these 300k words. It will be good to get one of the matching word as well.

id | words_in_sentences | freq | is_word_present | word

  1 | [county,went,live] | 400 | True | live

  2 | [going,on,mht00] | 500 | True | on

Approaches

  1. Apply a look up function row-wise

        def lookup_none_simple(row):

            sent='words_in_sentences'
            
            
            row['matching_word']=[z for z in words if z in row[sent]]
                        
            return row
        
        none=none.apply(lambda row:lookup_none_simple(row),axis=1)


  1. Approach 2:

Change the list to set and then apply the same function row wise.

row['matching_word']=set(words).intersection(set(row[sent]))

  1. Use CountVectorizer for mapping words to numbers and go with the same logic.

Issues

  • All these approaches are time consuming. I ran these methods on a large instance with 64GB RAM, still there were memory errors.

Other approaches

Use multiprocessing

  • Use JOBLIB
  • Use Multiprocessing
  • Use DASK for the searches
  • Use SPARK ( Didn't try)

I am finding it extremely hard to reduce the time taken for this task. Can somebody give an advice.

pnv
  • 1,437
  • 3
  • 23
  • 52
  • What is the length of the lists in 'words_in_sentences' column. is it constant? – Rajesh Bhat Sep 18 '20 at 12:35
  • Length of the lists in 'words_in_sentences' is between 3 to 20 words. On average, it will be around 6 words. – pnv Sep 18 '20 at 15:59
  • https://stackoverflow.com/questions/7571635/fastest-way-to-check-if-a-value-exists-in-a-list - Very useful post – pnv Sep 19 '20 at 13:58

1 Answers1

1

I don't really know how big a dataset this method can handle. But it works fine on small dataset. Maybe you can split the dataframe into batches of say 200K and apply this method in a loop. You can use the explode() method in pandas and then do a join with words.

import pandas as pd

df = pd.DataFrame({'words_in_sentences':[['county','went','live'], ['going','on','mht00', 'dum'],
                                    ['abc', 'def']],
                  'freq':[400, 500, 600]})
df['id'] = df.index

words=['on','mht','live']
df_words = pd.DataFrame({'words':words})

df_exp = df[['id', 'words_in_sentences']].explode('words_in_sentences')
df_exp = df_exp.merge(df_words, left_on = 'words_in_sentences', right_on = 'words')
df_matched = df_exp.groupby('id', as_index=False).agg({'words':list})
df_matched['is_word_present'] = True

df = df.merge(df_matched, on = 'id', how = 'left')
df['is_word_present'] = df.is_word_present.fillna(False)
Rajesh Bhat
  • 980
  • 6
  • 8