0

I am filtering a row of dataframe based on specific words/phrases. That keywords_list contains 4040 terms. There are around 1.5 million records in the dataset. And it takes forever to complete this. How can I speedup this operation?

def is_phrase_in(phrase,text):
    return re.search(r"\b{}\b".format(phrase), text, re.IGNORECASE) is not None

def filter_on_keywords(text,keywords_list):
  keyword_count = 0
  for i in keywords_list:
    if is_phrase_in(i,text):
      keyword_count += 1
  return keyword_count

Time is taken for running on 10 rows of data - 1 loop, best of 3: 5.63 s per loop

%%timeit
df[:10].apply(lambda x: filter_on_keywords(x['text'],keywords_list),axis=1)

For 50 rows, it takes 1 loop, best of 3: 28.4 s per loop

joel
  • 1,156
  • 3
  • 15
  • 42
  • For performance, consider a specialist library, e.g. [Aho-Corasick algorithm](https://stackoverflow.com/a/48600345/9209546). – jpp Jul 08 '20 at 09:46
  • Why are you using simple filtering? You need to use pandas dataframe filtering...it supports regex as well...that will be faster than this... – vks Jul 08 '20 at 09:47
  • This is also complicated by the fact you might actually need to count both `word1` and `word1 word2` keywords in `word word here` . – Wiktor Stribiżew Jul 08 '20 at 10:07

0 Answers0