Drastically different timing on "equivalent" never-matching regex?

Question

I recently timed a bunch of regexes for the question "A Regex that will never be matched by anything" (my answer here, see for more information).

However, after my testing I noticed that the regex 'a^' and 'x^' took drastically different amounts of time to check, although they should be identical. (It was only by chance that I even switched the character.) These timings are below.

In [1]: import re

In [2]: with open('/tmp/longfile.txt') as f:
   ...:     longfile = f.read()
   ...:     

In [3]: len(re.findall('\n',longfile))
Out[3]: 275000

In [4]: len(longfile)
Out[4]: 24733175

...

In [45]: %timeit re.search('x^',longfile)
6.89 ms ± 31.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [46]: %timeit re.search('a^',longfile)
37.2 ms ± 739 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [47]: %timeit re.search(' ^',longfile)
49.8 ms ± 844 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Online testing (with just the first 50 lines) shows the same behavior (1441880 steps and ~710ms vs only 40858 steps and ~113ms): https://regex101.com/r/AwaHmK/1

What is Python doing here that makes 'a^' take so much longer than 'x^'?

Just to see if there was something going on inside timeit or IPython, I wrote a simple timing function myself, and everything checks out:

In [57]: import time

In [59]: import numpy as np

In [62]: def timing(regex,N=7,n=100):
    ...:     tN = []
    ...:     for i in range(N):
    ...:         t0 = time.time()
    ...:         for j in range(n):
    ...:             re.search(regex,longfile)
    ...:         t1 = time.time()
    ...:         tN.append((t1-t0)/n)
    ...:     return np.mean(tN)*1000, np.std(tN)*1000
    ...: 

In [63]: timing('a^')
Out[63]: (37.414282049451558, 0.33898056279589844)

In [64]: timing('x^')
Out[64]: (7.2061508042471756, 0.22062989840321218)

I also replicated my results outside of IPython, in a standard 3.5.2 shell. So the oddity is not constrained to either IPython or timeit.

wim · Accepted Answer · 2018-05-29T15:51:34.023

4

As mentioned in the linked question, this regex scans the whole text.

The timing discrepancies seen here are just because "a" is such a common letter in English text, and you used readable data. So, if you study how regex engines work, you will understand: using the pattern a^ causes many more delays due to finding the tentative matches on the first "a", which then get rejected later. The engine has two "reading heads" which both move from left to right - one moves in the string, the other moves in the regex pattern. Using the a^ pattern in combination with your choice of human-readable data, the regex engine has to do more work. Since "x" is an uncommon letter in the corpus, using the pattern x^ wastes less time - more positions in the text can be rejected immediately.

If you start the pattern with another common letter, such as "e", it will be just as slow (using e^ will be even slower than a^, because "e" is more frequently appearing in English).
If you use random ascii bytes for the corpus, instead of real text, both a^ and x^ patterns will perform similarly.

In summary, these two "equivalent" never-matching regex patterns a^ and x^ are actually not so equivalent when taking into account the wider context of the regex engine's inner workings and the chosen test data.

edited May 29 '18 at 15:51

answered Nov 05 '17 at 05:00

wim

338,267
99
616
750

Please read my answer to the linked question -- it clearly demonstrates that in python's regex implementation, the "inefficient" `'x^'` is substantially faster than the other proposed solutions, whereas @arantius found the contrary for `egrep`. – nivk Nov 05 '17 at 05:11
I did read it. And indeed I can reproduce the result that the "ineffecient" `'x^'` is faster than using the lookaround. It's interesting, but that's not what your question here is asking about - and it doesn't really change my answer. – wim Nov 05 '17 at 05:18
wim's right. Provided subject string contains 8 occurrences of `x` but 310 occurrences of `a`. That's something. @nivk – revo Nov 05 '17 at 05:28
Oh I wasn't implying that s/he wasn't! I was just wondering what the first part had to do with the question at hand (aside from being not quite accurate). The letter popularity explanation is very interesting! This may require a follow-up question, but does this also hold for `egrep`? – nivk Nov 05 '17 at 05:34
2

I wanna [share an answer of mine with you](https://stackoverflow.com/a/39932775/1020526). It doesn neither talk about python regex engine specifically nor expressions you provided but it shows how different it could be when engine does some optimization checks behind the scenes. @nivk – revo Nov 05 '17 at 05:37

Drastically different timing on "equivalent" never-matching regex?

1 Answers1

Linked