2

I have a pandas series of ~350k rows, and I want to apply the pandas.Series.str.extract function using a regular expression consisting of ~100 substrings, such as:

'(item0|item1|item2|item3|item4|item5|item6|item7|item8|item9|item10|item11|item12|item13|item14|item15|item16|item17|item18|item19|item20|item21|item22|item23|item24|item25|item26|item27|item28|item29|item30|item31|item32|item33|item34|item35|item36|item37|item38|item39|item40|item41|item42|item43|item44|item45|item46|item47|item48|item49|item50|item51|item52|item53|item54|item55|item56|item57|item58|item59|item60|item61|item62|item63|item64|item65|item66|item67|item68|item69|item70|item71|item72|item73|item74|item75|item76|item77|item78|item79|item80|item81|item82|item83|item84|item85|item86|item87|item88|item89|item90|item91|item92|item93|item94|item95|item96|item97|item98|item99|item100)'

The extract is too slow: it takes 1 minute in my jupyter notebook (Python 3.9). Why is it so slow and how to speed it up?

Edit 1 I used 'itemX' as an example, but it can be substituted by any substring. The regular expression could be something like

'(carrageenan|dihydro|basketball|etc...)'

Edit 2 Answer to some comments:

  • I'm looking for exact matches
  • I already precompile the regex using re.compile()
Brainless
  • 1,522
  • 1
  • 16
  • 30
  • Why not use pattern `r'(item\d+)'` ? – Andrej Kesely Jun 27 '21 at 18:02
  • I used 'itemX' as an example, but it can be any substring. The regular expression could be something like '(carrageenan|dihydro|basketball|etc...) – Brainless Jun 27 '21 at 18:10
  • Does pre-compiling the regex pattern with `re.compile()` help ? – SeaBean Jun 27 '21 at 19:01
  • 1
    @SeaBean [pandas does that](https://github.com/pandas-dev/pandas/blob/7c48ff4409c622c582c56a5702373f726de08e96/pandas/core/strings/accessor.py#L3041). – Mustafa Aydın Jun 27 '21 at 19:04
  • Can you provide a minimal working example? I am not able to reproduce your problem with random strings (of size 6-10) with one pd.Series of 350k items and your provided regexp. It takes between 0.1 to 0.3s on my machine (so about 300x times faster...). – Jérôme Richard Jun 27 '21 at 19:10
  • @MustafaAydın Thanks for the info. The doc is about `str.extract()` and related. Is that all Pandas methods allowing regex do the same ? How about `str.contains()` ? – SeaBean Jun 27 '21 at 19:10
  • 1
    `str.contains` [does too](https://github.com/pandas-dev/pandas/blob/7c48ff4409c622c582c56a5702373f726de08e96/pandas/core/strings/object_array.py#L110). If we check others, they also probably do. @SeaBean – Mustafa Aydın Jun 27 '21 at 19:16
  • 1
    That's really nice! Much appreciated @MustafaAydın – SeaBean Jun 27 '21 at 19:20

1 Answers1

5

In most cases, the problem with searching for multiple words is related to the fact that many of the search words share the same prefix, and the more such words are in the list, the more backtracking steps are required to find a match, which slows the code execution.

A regex trie will come to rescue here, together with word boundaries (since you need an exact match). Install pip install trieregex and use

from trieregex import TrieRegEx
keywords = ['item0','item1','item2','item3']
tr = TrieRegEx(*keywords)
pattern = fr'\b({tr.regex()})\b'

Then, you can use the pattern with .str.extract() method.

If you do not need to use some third party library to generate the regex trie, you can use the code from this SO post.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Interesting, but isn't that the purpose of regexp compilation/engine to do such a work? I know that some engine generate a minimal deterministic automaton like RE and generally not the standard PCRE engine, but this simplification seems a basic one for a regexp engine... – Jérôme Richard Jun 27 '21 at 19:33
  • 1
    @JérômeRichard The `re` regex engine does not do this under the hood. – Wiktor Stribiżew Jun 27 '21 at 20:53