1

I have a df with a text column. Say:

d = {'text': ["merry had a little lamb and a broken limb", "Little Jonathan found a chicken"]}
df = pd.DataFrame(data=d)

I also have a list with ~400 keywords, for example:

observations_words_list = ["bent","block","broken"]

I wish to see how many records has more than one keyword in their text, and I'm doing this like that:

    df = df['text'].dropna().reset_index()
    df_len = len(df['text'])
    obs = set(observations_words_list)
    df['Observations'] = df['text'].apply(lambda x: len(set(str(x).lower().split()).intersection(obs)))
    obs_count = len(df[df['Observations'] > 0])/df_len

For the sample df (in reality I read csv with ~0.5m records), I expect the new column to hold 1 for the first record, 0 for the second, and overall obs_count=0.5

Runtime is far from ideal, and I'm looking for a faster way to process this step.

Would love your ideas. Thanks!

goidelg
  • 316
  • 2
  • 16
  • 1
    Can you check [this](https://stackoverflow.com/questions/48541444/pandas-filtering-for-multiple-substrings-in-series) ? – jezrael Aug 24 '21 at 06:10
  • 1
    Solution is `Aho-Corasick algorithm` here – jezrael Aug 24 '21 at 06:12
  • That isnt applicable here as it looks for words within words, and here I believe Gavriel is just looking for whole words in the string (hence use of str.split()) – Daniel Redgate Aug 24 '21 at 08:02

3 Answers3

1

You could try the following one-liner:

print(df['text'].str.lower().str.split(expand=True).isin(set(observations_words_list)).sum(axis=1).mean())

Output:

0.5
U13-Forward
  • 69,221
  • 14
  • 89
  • 114
  • tested `df = pd.concat([df] * 100000, ignore_index=True)` – jezrael Aug 24 '21 at 06:17
  • OP solution `448 ms ± 132 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)` – jezrael Aug 24 '21 at 06:18
  • Your `846 ms ± 22.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)` – jezrael Aug 24 '21 at 06:18
  • Tested your improved `df['text'].str.lower().str.split(expand=True).isin(observations_words_list).sum(axis=1).mean()` – jezrael Aug 24 '21 at 06:18
  • @jezrael How about now – U13-Forward Aug 24 '21 at 06:19
  • 1
    `%%timeit df['text'].str.lower().str.split(expand=True).isin(set(observations_words_list)).sum(axis=1).mean() 834 ms ± 7.91 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)` – jezrael Aug 24 '21 at 06:20
  • faster `10ms` only :( – jezrael Aug 24 '21 at 06:20
  • Wow. I'm impressed, @jezrael – goidelg Aug 24 '21 at 06:20
  • @U12-Forward Thanks for your answer!!!!! Super appreciated. Actually, I was about to implement it, as it seemed more intuitive. With that, if the other solution is twice as fast, I might go with that one. Really, super thanks! – goidelg Aug 24 '21 at 06:23
  • @GavrielGoidel Your welcome, would it be worth an upvote :) – U13-Forward Aug 24 '21 at 06:24
  • @U12-Forward, I was blinded by the speed test, but I'm concerned about time I'll spend implementing an algorithm I don't understand. I'm considering your solution which is much more straightforward. qq - I can see why to use ```mean``` if results are binary (sum on the 1s over count = mean), but when there is more than one keywords, results might be skewed, don't you think? – goidelg Aug 24 '21 at 06:40
  • @GavrielGoidel No It will still work because of `sum`. – U13-Forward Aug 24 '21 at 06:41
1

There is possible improvement with mean and .gt(0):

df['text'].apply(lambda x: len(set(str(x).lower().split()).intersection(obs))).gt(0).mean()

d = {'text': ["merry had a little lamb and a broken limb", "Little Jonathan found a chicken"]}
df = pd.DataFrame(data=d)
df = pd.concat([df] * 100000, ignore_index=True)


In [44]: %%timeit
    ...: df_len = len(df['text'])
    ...: obs = set(observations_words_list)
    ...: df['Observations'] = df['text'].apply(lambda x: len(set(str(x).lower().split()).intersection(obs)))
    ...: obs_count = len(df[df['Observations'] > 0])/df_len
    ...: 
    ...: 
448 ms ± 132 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [45]: %%timeit
    ...: obs = set(observations_words_list)
    ...: df['text'].apply(lambda x: len(set(str(x).lower().split()).intersection(obs))).mean()
    ...: 
    ...: 
324 ms ± 4.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

290ms for 100k * df

I understand you just want to see how many records contain ONE OR more keyword, and the absolute count isnt relevant. In which case there is an even better (and in my opinion more elegant) one line solution using regex which doesn't iterate over the data using apply:

d = {'text': ["merry had a little lamb and a broken limb", "Little Jonathan found a chicken"]*100000}
df = pd.DataFrame(data=d)

observations_words_list = ["bent","block","broken"]
obs = '|'.join(r"\b{}\b".format(x) for x in observations_words_list)
contains_obs = df['text'].str.contains(obs, flags=re.IGNORECASE, regex=True)
obs_count = sum(contains_obs)/len(df['text'])

If you wish to keep the number of observations, you can use 'count' for a small time penalty (410ms):

df['Observations'] = df['text'].str.count(obs, flags=re.IGNORECASE)
obs_count = sum(df['Observations']>0)/len(df['text'])
Daniel Redgate
  • 219
  • 1
  • 6
  • 1
    ya, seems OP dont care about performance. – jezrael Aug 24 '21 at 08:16
  • just a different search. However I appreciated your insight as I'll be using the Aho-Corasick algorithm for myself in future :) so giving you a plus one anyway. The algo unfortunately takes 5.5 seconds for 100k*df, but of course it is doing a lot more work on the strings... – Daniel Redgate Aug 24 '21 at 08:24
  • 1
    Btw, need `'|'.join(r"\b{}\b".format(x) for x in observations_words_list)` for word boundaries – jezrael Aug 24 '21 at 08:26
  • just replaced ["bent","block","broken"] with observations_words_list for gd measure... can you accept answer? – Daniel Redgate Aug 24 '21 at 08:30
  • 1
    You won buddy. Thanks! – goidelg Aug 24 '21 at 11:58