Keywords count in df: Performance tuning

Question

I have a df with a text column. Say:

d = {'text': ["merry had a little lamb and a broken limb", "Little Jonathan found a chicken"]}
df = pd.DataFrame(data=d)

I also have a list with ~400 keywords, for example:

observations_words_list = ["bent","block","broken"]

I wish to see how many records has more than one keyword in their text, and I'm doing this like that:

    df = df['text'].dropna().reset_index()
    df_len = len(df['text'])
    obs = set(observations_words_list)
    df['Observations'] = df['text'].apply(lambda x: len(set(str(x).lower().split()).intersection(obs)))
    obs_count = len(df[df['Observations'] > 0])/df_len

For the sample df (in reality I read csv with ~0.5m records), I expect the new column to hold 1 for the first record, 0 for the second, and overall obs_count=0.5

Runtime is far from ideal, and I'm looking for a faster way to process this step.

Would love your ideas. Thanks!

Can you check [this](https://stackoverflow.com/questions/48541444/pandas-filtering-for-multiple-substrings-in-series) ? — jezrael, Aug 24 '21 at 06:10
That isnt applicable here as it looks for words within words, and here I believe Gavriel is just looking for whole words in the string (hence use of str.split()) — Daniel Redgate, Aug 24 '21 at 08:02

U13-Forward · Answer 1 · 2021-08-24T06:19:12.467

1

You could try the following one-liner:

print(df['text'].str.lower().str.split(expand=True).isin(set(observations_words_list)).sum(axis=1).mean())

Output:

0.5

edited Aug 24 '21 at 06:19

answered Aug 24 '21 at 06:13

U13-Forward

69,221
14
89
114

tested `df = pd.concat([df] * 100000, ignore_index=True)` – jezrael Aug 24 '21 at 06:17
OP solution `448 ms ± 132 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)` – jezrael Aug 24 '21 at 06:18
Your `846 ms ± 22.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)` – jezrael Aug 24 '21 at 06:18
Tested your improved `df['text'].str.lower().str.split(expand=True).isin(observations_words_list).sum(axis=1).mean()` – jezrael Aug 24 '21 at 06:18
@jezrael How about now – U13-Forward Aug 24 '21 at 06:19
1

`%%timeit df['text'].str.lower().str.split(expand=True).isin(set(observations_words_list)).sum(axis=1).mean() 834 ms ± 7.91 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)` – jezrael Aug 24 '21 at 06:20
faster `10ms` only :( – jezrael Aug 24 '21 at 06:20
Wow. I'm impressed, @jezrael – goidelg Aug 24 '21 at 06:20
@U12-Forward Thanks for your answer!!!!! Super appreciated. Actually, I was about to implement it, as it seemed more intuitive. With that, if the other solution is twice as fast, I might go with that one. Really, super thanks! – goidelg Aug 24 '21 at 06:23
@GavrielGoidel Your welcome, would it be worth an upvote :) – U13-Forward Aug 24 '21 at 06:24
@U12-Forward, I was blinded by the speed test, but I'm concerned about time I'll spend implementing an algorithm I don't understand. I'm considering your solution which is much more straightforward. qq - I can see why to use ```mean``` if results are binary (sum on the 1s over count = mean), but when there is more than one keywords, results might be skewed, don't you think? – goidelg Aug 24 '21 at 06:40
@GavrielGoidel No It will still work because of `sum`. – U13-Forward Aug 24 '21 at 06:41

jezrael · Answer 2 · 2021-08-24T06:34:06.083

1

There is possible improvement with mean and .gt(0):

df['text'].apply(lambda x: len(set(str(x).lower().split()).intersection(obs))).gt(0).mean()

d = {'text': ["merry had a little lamb and a broken limb", "Little Jonathan found a chicken"]}
df = pd.DataFrame(data=d)
df = pd.concat([df] * 100000, ignore_index=True)


In [44]: %%timeit
    ...: df_len = len(df['text'])
    ...: obs = set(observations_words_list)
    ...: df['Observations'] = df['text'].apply(lambda x: len(set(str(x).lower().split()).intersection(obs)))
    ...: obs_count = len(df[df['Observations'] > 0])/df_len
    ...: 
    ...: 
448 ms ± 132 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [45]: %%timeit
    ...: obs = set(observations_words_list)
    ...: df['text'].apply(lambda x: len(set(str(x).lower().split()).intersection(obs))).mean()
    ...: 
    ...: 
324 ms ± 4.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

edited Aug 24 '21 at 06:34

answered Aug 24 '21 at 06:26

jezrael

822,522
95
1,334
1,252

I can see why to use mean if results are binary (sum on the 1s over count = mean), but when there is more than one keywords, results might be skewed, don't you think? – goidelg Aug 24 '21 at 06:31
@GavrielGoidel - I need test it. – jezrael Aug 24 '21 at 06:32
@GavrielGoidel - Another solution is faster? – jezrael Aug 24 '21 at 07:12

Daniel Redgate · Accepted Answer · 2021-08-24T15:56:47.747

1

290ms for 100k * df

I understand you just want to see how many records contain ONE OR more keyword, and the absolute count isnt relevant. In which case there is an even better (and in my opinion more elegant) one line solution using regex which doesn't iterate over the data using apply:

d = {'text': ["merry had a little lamb and a broken limb", "Little Jonathan found a chicken"]*100000}
df = pd.DataFrame(data=d)

observations_words_list = ["bent","block","broken"]
obs = '|'.join(r"\b{}\b".format(x) for x in observations_words_list)
contains_obs = df['text'].str.contains(obs, flags=re.IGNORECASE, regex=True)
obs_count = sum(contains_obs)/len(df['text'])

If you wish to keep the number of observations, you can use 'count' for a small time penalty (410ms):

df['Observations'] = df['text'].str.count(obs, flags=re.IGNORECASE)
obs_count = sum(df['Observations']>0)/len(df['text'])

edited Aug 24 '21 at 15:56

answered Aug 24 '21 at 08:07

Daniel Redgate

219
1
6

1

ya, seems OP dont care about performance. – jezrael Aug 24 '21 at 08:16
just a different search. However I appreciated your insight as I'll be using the Aho-Corasick algorithm for myself in future :) so giving you a plus one anyway. The algo unfortunately takes 5.5 seconds for 100k*df, but of course it is doing a lot more work on the strings... – Daniel Redgate Aug 24 '21 at 08:24
1

Btw, need `'|'.join(r"\b{}\b".format(x) for x in observations_words_list)` for word boundaries – jezrael Aug 24 '21 at 08:26
just replaced ["bent","block","broken"] with observations_words_list for gd measure... can you accept answer? – Daniel Redgate Aug 24 '21 at 08:30
1

You won buddy. Thanks! – goidelg Aug 24 '21 at 11:58

Keywords count in df: Performance tuning

3 Answers3