I have a df with a text column. Say:
d = {'text': ["merry had a little lamb and a broken limb", "Little Jonathan found a chicken"]}
df = pd.DataFrame(data=d)
I also have a list with ~400 keywords, for example:
observations_words_list = ["bent","block","broken"]
I wish to see how many records has more than one keyword in their text, and I'm doing this like that:
df = df['text'].dropna().reset_index()
df_len = len(df['text'])
obs = set(observations_words_list)
df['Observations'] = df['text'].apply(lambda x: len(set(str(x).lower().split()).intersection(obs)))
obs_count = len(df[df['Observations'] > 0])/df_len
For the sample df (in reality I read csv with ~0.5m records), I expect the new column to hold 1 for the first record, 0 for the second, and overall obs_count=0.5
Runtime is far from ideal, and I'm looking for a faster way to process this step.
Would love your ideas. Thanks!