I am filtering a row of dataframe based on specific words/phrases. That keywords_list contains 4040 terms. There are around 1.5 million records in the dataset. And it takes forever to complete this. How can I speedup this operation?
def is_phrase_in(phrase,text):
return re.search(r"\b{}\b".format(phrase), text, re.IGNORECASE) is not None
def filter_on_keywords(text,keywords_list):
keyword_count = 0
for i in keywords_list:
if is_phrase_in(i,text):
keyword_count += 1
return keyword_count
Time is taken for running on 10 rows of data - 1 loop, best of 3: 5.63 s per loop
%%timeit
df[:10].apply(lambda x: filter_on_keywords(x['text'],keywords_list),axis=1)
For 50 rows, it takes 1 loop, best of 3: 28.4 s per loop