1

Python re module does not have atomic grouping, it can be however emulated, for example see this Q&A. I was under the impression that using atomic grouping would be faster than using a simple alternation regex, due to the fact that it will not try all the alternatives in the group. But this does not hold, see below:

import re
import timeit
import random

random.seed(42)

words = ["tricky", "liquid", "sleepy", "crowded", "half", "secretary", "roll", "educate", "medical", "closed",
         "unaccountable", "earthy", "permit", "pleasant", "confuse", "enter", "land", "encourage", "connection",
         "mindless", "spicy",
         "cracker", "twist"]

atomic_group = re.compile(
    r"(?=(unaccountable|l(?:and|iquid)|half|t(?:ricky|wist)|roll|p(?:ermit|leasant)|s(?:picy|ecretary|leepy)|e(?:ducate|arthy|n(?:ter|courage))|m(?:indless|edical)|c(?:r(?:owded|acker)|on(?:fuse|nection)|losed)))\1")
non_atomic_group = re.compile(
    r"(?:unaccountable|l(?:and|iquid)|half|t(?:ricky|wist)|roll|p(?:ermit|leasant)|s(?:picy|ecretary|leepy)|e(?:ducate|arthy|n(?:ter|courage))|m(?:indless|edical)|c(?:r(?:owded|acker)|on(?:fuse|nection)|losed))")

sentence = " ".join(random.choices(words, k=10000))

print(timeit.timeit("atomic_group.findall(sentence)",
                    setup="from __main__ import atomic_group, sentence",
                    number=10))

print(timeit.timeit("non_atomic_group.findall(sentence)",
                    setup="from __main__ import non_atomic_group, sentence",
                    number=10))

Output

0.036498754999999994
0.028361783

The same behaviour is observed for larger datasets as is shown in the following plot: enter image description here

In the plot len(data) represents an increasing number of sentences (strings formed by 60 words). The code to reproduce it can be found here.

Is my assumption incorrect? On a more general note how can I write a regular expression (in Python) that will only try one of the branches in an alternation regex and none of the others?

Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76

1 Answers1

1

Your assumption is not correct. The whole point of atomic patterns is to prevent backtracking into the pattern.

The atomic_group pattern is of (?=(...))\1 type in your code and the non-atomic one is of (?:...) type. So, the first one already loses to the second one due to the use of a capturing group, see capturing group VS non-capturing group.

Besides, you need to match the strings twice with the atomic_group pattern, first, with the lookahead, second, with the backreference.

So, only use this techinque when you need to control backtracking inside a longer pattern.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks! Is there a way to tell Python to only try one of the branches of the alternation? – Dani Mesejo Nov 04 '21 at 11:09
  • @DaniMesejo This sounds as if you want to avoid trying them, and in that case, why add them to the regex at all? I mean, in the current scenario, there is no need of an atomic grouping. Else, when you need to match and "fix" the match result in the current matching path, use this atomic thing. Also, remember that in NFA regex-directed engines the first alternative that matches "wins" and the rest of the alternatives are ignored. See [Remember That The Regex Engine Is Eager](https://www.regular-expressions.info/alternation.html#eager). This may be just enough for you. – Wiktor Stribiżew Nov 04 '21 at 11:11
  • Sorry I'll try to express myself better, imagine for example the regular expression: "u(abc|fgh)" and the string "uakde", this will fail to match the "b" in the first pattern and then it will try to match "fgh", right? What I want is regex that will not try to match "fgh" if it already partially matched other alternative, is kind of an exclusive or. – Dani Mesejo Nov 04 '21 at 11:19
  • @DaniMesejo There is no way to do that. Also, `u(?>abc|fgh|acle)` [matches](https://regex101.com/r/H6uzzd/1) `uacle` meaning that `fgh` is also tried even in an atomic group. – Wiktor Stribiżew Nov 04 '21 at 11:24