1

I have a database with over 2 million rows. I'm trying to find rows that contain both of two words using regex like:

df1 = df[df['my_column'].str.contains(r'(?=.*first_word)(?=.*second_word)')]

However, when trying to process this in jupyter notebook, it either takes over a minute to return these rows or it crashes the kernal and I have to try again.

Is there a more efficient way for me to return rows in a dataframe that contain both words?

Toms Code
  • 1,439
  • 3
  • 15
  • 34
  • You can look into Dask, to parallelize the read https://stackoverflow.com/questions/42550529/dask-how-would-i-parallelize-my-code-with-dask-delayed – Himanshu Kandpal Jan 27 '21 at 14:18
  • Thanks hkandpal, Im not familair with Dask. I'll check it out – Toms Code Jan 27 '21 at 16:20
  • maybe first you could try simpler regex - ie. `contains("first_word.*second_word")` - and it may need less time to search it. – furas Jan 27 '21 at 17:11
  • Hi Fursas thank you. I just tried the above query but it returned 0 results, is there something I'm missing from it? – Toms Code Jan 27 '21 at 17:34
  • I checked how exactly works your regex and it seems it would have to be `"first_word.*second_word|second_word.*first_word"`. It could also use `?` to make it `lazy` and it would check less chars `"first_word.*?second_word|second_word.*?first_word"`. But I don't have big data to test if it will be faster. – furas Jan 27 '21 at 20:11
  • 2
    What about `df['my_column'].apply(lambda x: all(l in x for l in ['first_word', 'second_word']) )`? – Ryszard Czech Jan 27 '21 at 22:55

1 Answers1

1

Use

df['my_column'].apply(lambda x: all(l in x for l in ['first_word', 'second_word']) )

It will make sure the words from the list are all present in the my_column column without an awkward regex.

Ryszard Czech
  • 18,032
  • 4
  • 24
  • 37