21

I have an application code which generates regexes dynamically from a config for some parsing. When timing performance of two variations, the regex variation with each part of an OR regex being captured is noticably slow than a normal regex. The reason would be overhead of certain operations internally in regex module.

>>> import timeit
>>> setup = '''
... import re
... '''   

#no capture group 
>>> print(timeit.timeit("re.search(r'hello|bye|ola|cheers','some say hello,some say bye, or ola or cheers!')", setup=setup))
0.922958850861

#with capture group
>>> print(timeit.timeit("re.search(r'(hello)|(bye)|(ola)|(cheers)','some say hello,some say bye, or ola or cheers!')", setup=setup))
1.44321084023

#no capture group
>>> print(timeit.timeit("re.search(r'hello|bye|ola|cheers','some say hello,some say bye, or ola or cheers!')", setup=setup))
0.913202047348

# capture group
>>> print(timeit.timeit("re.search(r'(hello)|(bye)|(ola)|(cheers)','some say hello,some say bye, or ola or cheers!')", setup=setup))
1.41544604301

Question: What causes this considerable drop in performance when using capture groups ?

DhruvPathak
  • 42,059
  • 16
  • 116
  • 175
  • 2
    You already wrote that *the reason would be overhead of certain operations internally in regex module.* What is it you are interested in? What happens if there are capturing groups in the pattern? – Wiktor Stribiżew Jan 03 '17 at 13:43
  • 1
    You specify *in Python* - are you saying that the Python `re` module is significantly slower than the RE engine in another language (using the same RE and data)? If you are, could you show the comparisons you have made please? – cdarke Jan 03 '17 at 14:06
  • @WiktorStribiżew Yes, interested in what is internally causing this. – DhruvPathak Jan 03 '17 at 14:44

2 Answers2

22

The reason is pretty simple, using capturing groups indicate the Engine to save the content in memory, while using non capturing group indicates the engine to not save anything. Consider that you are telling the engine to perform more operations.

For instance, using this regex (hello|bye|ola|cheers) or (hello)|(bye)|(ola)|(cheers) will impact considerably higher than using an atomic group or a non capturing one like (?:hello|bye|ola|cheers).

When using regex you know if you want to capture or not capture content like the case above. If you want to capture any of those words, you will lose performance but if you don't need to capture content then you can save performance by improving it like using non-capturing groups

I know you tagged python, but have have prepared an online benchmark for javascript to show how capturing and non-capturing groups impacts in the js regex engine.

https://jsperf.com/capturing-groups-vs-non-capturing-groups

enter image description here

Federico Piazza
  • 30,085
  • 15
  • 87
  • 123
  • Those differences are far smaller than in OP's (admittedly limited) test. – Jongware Jan 03 '17 at 14:10
  • @RadLexus, yes, it called my attention too. I don't know how python regex engine performs with capturing group and alternations. I like jsperf because it helps doing benchmarks pretty easily, but it's js instead of python. Anyway, my idea was to show that using capturing groups is more expensive than using non-capturing groups or no groups at all. – Federico Piazza Jan 03 '17 at 14:15
  • Unless Python's regex is extremely more inefficient than Javascript's, these differences are small enough to be considered not significant .. I'm leaning towards internal object managing inside Python. The fact that one regex creates more objects than another is then not the "cause" - another solution without regex but *with* the same number of additional objects would also be slower. – Jongware Jan 03 '17 at 14:21
  • 2
    @RadLexus, I agree. In js regex, the impact of using capturing groups and non capturing groups is around 20% (it's a considerably impact imho), but on Python seems to be around 40%, so as you said memory/object management could be the culprit of the remaining 20%. – Federico Piazza Jan 03 '17 at 14:31
12

Your patterns only differ in the capturing groups. When you define a capturing group in the regex pattern and use the pattern with re.search, the result will be a MatchObject instance. Each match object will contain as many groups as there are capturing groups in the pattern, even if they are empty. That is the overhead for the re internals: adding the (list of) groups (memory allocation, etc.). Mind that groups also contain such details as the starting and ending index of the text that they match and more (refer to the MatchObject reference).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Why is this `(?:hello|bye|ola|cheers)` faster than this: `hello|bye|ola|cheers` ? – Mohammad Yusuf Jan 03 '17 at 14:06
  • Do you mean the pattern is standalone? Functionally, they are the same, as only Group 0 (the whole match) is created when both are run. – Wiktor Stribiżew Jan 03 '17 at 14:10
  • Tested on `string` in question with `re.search`. The former is slightly faster. – Mohammad Yusuf Jan 03 '17 at 14:12
  • 4
    I do not think that using *redundant* non-capturing groups (as you used here) will eventually yield better performance. I remember of quite opposite results (in PHP with PCRE). The point is that the more capturing groups, the less efficient the pattern is, so the rule is to get rid of any *redundant capturing groups*. The rest is already nano-optimization that you may skip. Note that non-capturing groups are considered less readable by the majority of users. – Wiktor Stribiżew Jan 03 '17 at 14:25