Introduction
I know that there are many related questions on this platform to this and I have been researching and trying out different approaches from multiple options available. However, I wanted to collate all this and get suggestions on this problem.
Problem
I have a large file (~20M records) with a few parameters. On loading into a pandas data frame, it looks like this:
id | words_in_sentences | freq
1 | [county,went,live] | 400
2 | [going,on,mht00] | 500
I have another file that has roughly 300k words.
words=['on','mht','live']
Expected result:
I want to check whether each list in the column 'words_in_sentences'
contain any of these 300k words. It will be good to get one of the matching word as well.
id | words_in_sentences | freq | is_word_present | word
1 | [county,went,live] | 400 | True | live 2 | [going,on,mht00] | 500 | True | on
Approaches
- Apply a look up function row-wise
def lookup_none_simple(row):
sent='words_in_sentences'
row['matching_word']=[z for z in words if z in row[sent]]
return row
none=none.apply(lambda row:lookup_none_simple(row),axis=1)
- Approach 2:
Change the list to set and then apply the same function row wise.
row['matching_word']=set(words).intersection(set(row[sent]))
- Use CountVectorizer for mapping words to numbers and go with the same logic.
Issues
- All these approaches are time consuming. I ran these methods on a large instance with 64GB RAM, still there were memory errors.
Other approaches
Use multiprocessing
- Use JOBLIB
- Use Multiprocessing
- Use DASK for the searches
- Use SPARK ( Didn't try)
I am finding it extremely hard to reduce the time taken for this task. Can somebody give an advice.