Searching for words using str.contains and regex in dataframe is slow, is there a better way?

Question

I have a database with over 2 million rows. I'm trying to find rows that contain both of two words using regex like:

df1 = df[df['my_column'].str.contains(r'(?=.*first_word)(?=.*second_word)')]

However, when trying to process this in jupyter notebook, it either takes over a minute to return these rows or it crashes the kernal and I have to try again.

Is there a more efficient way for me to return rows in a dataframe that contain both words?

You can look into Dask, to parallelize the read https://stackoverflow.com/questions/42550529/dask-how-would-i-parallelize-my-code-with-dask-delayed — Himanshu Kandpal, Jan 27 '21 at 14:18
Thanks hkandpal, Im not familair with Dask. I'll check it out — Toms Code, Jan 27 '21 at 16:20
maybe first you could try simpler regex - ie. `contains("first_word.*second_word")` - and it may need less time to search it. — furas, Jan 27 '21 at 17:11
Hi Fursas thank you. I just tried the above query but it returned 0 results, is there something I'm missing from it? — Toms Code, Jan 27 '21 at 17:34
I checked how exactly works your regex and it seems it would have to be `"first_word.*second_word|second_word.*first_word"`. It could also use `?` to make it `lazy` and it would check less chars `"first_word.*?second_word|second_word.*?first_word"`. But I don't have big data to test if it will be faster. — furas, Jan 27 '21 at 20:11
What about `df['my_column'].apply(lambda x: all(l in x for l in ['first_word', 'second_word']) )`? — Ryszard Czech, Jan 27 '21 at 22:55

score 1 · Accepted Answer · answered Jan 28 '21 at 22:07

1

Use

df['my_column'].apply(lambda x: all(l in x for l in ['first_word', 'second_word']) )

It will make sure the words from the list are all present in the my_column column without an awkward regex.

answered Jan 28 '21 at 22:07

Ryszard Czech

18,032
4
24
37

Searching for words using str.contains and regex in dataframe is slow, is there a better way?

1 Answers1