30

Consider this Python code:

import timeit
import re

def one():
        any(s in mystring for s in ('foo', 'bar', 'hello'))

r = re.compile('(foo|bar|hello)')
def two():
        r.search(mystring)


mystring="hello"*1000
print([timeit.timeit(k, number=10000) for k in (one, two)])
mystring="goodbye"*1000
print([timeit.timeit(k, number=10000) for k in (one, two)])

Basically, I'm benchmarking two ways to check existence of one of several substrings in a large string.

What I get here (Python 3.2.3) is this output:

[0.36678314208984375, 0.03450202941894531]
[0.6672089099884033, 3.7519450187683105]

In the first case, the regular expression easily defeats the any expression - the regular expression finds the substring immediately, while the any has to check the whole string a couple of times before it gets to the correct substring.

But what's going on in the second example? In the case where the substring isn't present, the regular expression is surprisingly slow! This surprises me, since theoretically the regex only has to go over the string once, while the any expression has to go over the string 3 times. What's wrong here? Is there a problem with my regex, or are Python regexs simply slow in this case?

cha0site
  • 10,517
  • 3
  • 33
  • 51
  • 1
    At least I can confirm the timing, but I have no clue why. +1 for oddity & copy'n'pasteable example. – Jonas Schäfer Jun 25 '12 at 14:08
  • @JBernardo: You mean instead of the tuple in `one()`? That makes no difference, actually. Removing the group in the regex by using "`foo|bar|hello`" makes the regex take 3.5s instead of 3.7s, but that's still really slow. – cha0site Jun 25 '12 at 14:15
  • If you use longer strings (`'foo'*1000`) and exclude simple length test (change `hello*1000` to `hello*200`), you'll see that regexp is generally slower, not just in the second example. – hamstergene Jun 25 '12 at 14:23
  • 1
    A guess, maybe the simple string search is optimized so it compares more than one byte per assembly instruction (using some SIMD instructions?), while the regex is not. – Qtax Jun 25 '12 at 15:26
  • @hamster: `'foo'*1000` is not a very interesting case, because the `any` expression short-circuits immediately. And the regex is still faster in that case, at least here - they're about the same though, as you'd expect. – cha0site Jun 25 '12 at 15:36
  • @Qtax: It's worse, actually, check my answer. – cha0site Jun 25 '12 at 15:36

4 Answers4

27

Note to future readers

I think the correct answer is actually that Python's string handling algorithms are really optimized for this case, and the re module is actually a bit slower. What I've written below is true, but is probably not relevant to the simple regexps I have in the question.

Original Answer

Apparently this is not a random fluke - Python's re module really is slower. It looks like it uses a recursive backtracking approach when it fails to find a match, as opposed to building a DFA and simulating it.

It uses the backtracking approach even when there are no back references in the regular expression!

What this means is that in the worst case, Python regexs take exponential, and not linear, time!

This is a very detailed paper describing the issue: http://swtch.com/~rsc/regexp/regexp1.html

I think this graph near the end summarizes it succinctly: graph of performance of various regular expression implementations, time vs. string length

cha0site
  • 10,517
  • 3
  • 33
  • 51
  • 1
    Very true. I've seen naive string search algorithms outperform an equivalent regex by more than an order of magnitude at several occasions here on SO. It's not only a Python issue and that's a pity, seeing that an optimizing engine could make these kind of regexes blazingly fast. – Niklas B. Jun 25 '12 at 15:39
  • 7
    Be aware, though, that the exponential runtime only happens for very, very specific cases (the paper calls them pathological) – Niklas B. Jun 25 '12 at 15:48
  • @Niklas: The paper argues that optimizers are part of the problem, not the solution. Also, of course you're right about this being a pathological case. I've edited the answer to mention that I'm talking about the worst case. – cha0site Jun 25 '12 at 15:52
  • The backtracking behavior of the regex matching algorithm doesn't really matter in this case, we never back up in the searched string (ignore the `b`). Compare searching the string for `x` and `y` with string functions vs `[xy]` regex. Each byte is compared two times, the regex never backtracks, but is still much slower. – Qtax Jun 25 '12 at 15:55
  • @cha0site: The problem is that modern regex implementations have much more power than the classical formal regular language expressions. If back references are involved, we can no longer construct a NFA to the regex, so we can't use the blazing fast DFA-based approach. I think optimization on a case-by-case base is the way to go. Of course if it's possible, a good optimization could be to construct a minimal DFA. **EDIT:** Just saw the relevant part of the paper: "A particularly clever implementation could combine the two, resorting to backtracking only to accommodate the backreferences." – Niklas B. Jun 25 '12 at 15:58
  • @Qtax: I guess in this particular case, the string functions are just a lot faster than the regex engine's internal algorithms. – Niklas B. Jun 25 '12 at 16:00
  • @NiklasB., btw it is possible to construct NFA for expressions with backreferences, as long as the backrefs have highly limited number of values. ;-) http://stackoverflow.com/questions/11101405/how-do-backreferences-in-regexes-make-backtracking-required – Qtax Jun 25 '12 at 16:02
  • @Qtax: Hm, okay. What I meant was that if backtracking is involved, the NFA's get too complex to be executed efficiently, let alone being transformed into a DFA. – Niklas B. Jun 25 '12 at 16:05
  • Niklas: Yes, I'm referring to clever implementations ^__^. That's what I meant when I said that `re` uses backtracking even in the no-backreferences case. – cha0site Jun 25 '12 at 16:09
  • @cha0site: Yeah, but that's not really correct in this particular case. It just doesn't use fast string search algorithms. – Niklas B. Jun 25 '12 at 16:12
  • @Qtax: I tested the `[xy]` code, it is slower, but it's not slower by an order of magnitude anymore. Also, note that `time grep 'foo|bar|hello' < testfile` where there are 10000 lines in the test file is about 0.1s here. I'd expect super optimized library code like `re` to be maybe 5x that... Not 37x. – cha0site Jun 25 '12 at 16:12
  • @Niklas: Perhaps I misunderstood then. Is the problem here not that `re` is implemented using backtracking, instead of the better algorithm explained in the paper? – cha0site Jun 25 '12 at 16:15
  • @cha0site: Backtracking is probably used for regexes like `a.*b` or even `ab?`, but I'm not sure if it's used for alternation groups. – Niklas B. Jun 25 '12 at 16:18
  • @cha0site, regex/string funct: `foo|bar|baz`/(`foo`, `bar`, `baz`) = 4.14, `[xyz]`/(`x`, `y`, `z`) = 4.79, on my system. Almost the same speed differences, so backtracking it self is not the issue here. Just seems that the string functions are better optimized in this case. – Qtax Jun 25 '12 at 16:30
  • @Qtax: You're right, and playing around with `n`, the growth does seem pretty linear. So I'm back to being confused. – cha0site Jun 25 '12 at 16:45
  • @cha0site, I want to write a simple test in C to compare iterating the string twice looking for `*p == c` or once looking for `*p == c1 || *p == c2`, to see if any CPU magic involved. And also compare them to the standard `strstr()`/`strchr()` to see if those are much faster. But no C compiler here atm. – Qtax Jun 25 '12 at 16:55
  • @Qtax: [This](http://svn.python.org/view/python/tags/r313/Objects/stringlib/fastsearch.h?revision=86835&view=markup) is the C code that ends up being called for the `in` operator in strings. Looks intense. – cha0site Jun 25 '12 at 17:13
  • @cha0site, cool, altho the single char code boils down to `for (i = 0; i < n; i++) if (s[i] == p[0]) return i;` – Qtax Jun 25 '12 at 17:44
  • @Qtax: Of course, in the case of single-char patterns this is the fastest way. For the strings in the question, though, Boyer-Moore is used. In general, string operations in Python are *very* fast. – Niklas B. Jun 26 '12 at 12:15
  • I think the regular expression module must be quite different between Python 2.7 and 3.2. I have a little program that does heavy regular expression processing, and it is > 16x slower under Python 2.7.2 than under Python 3.2.2. So when benchmarking these things, don't neglect testing under both versions. – Dave Burton Jul 09 '12 at 13:14
3

My coworker found the re2 library (https://code.google.com/p/re2/)? There is a python wrapper. It's a bit to get installed on some systems.

I was having the same issue with some complex regexes and long strings -- re2 sped the processing time up significantly -- from seconds to milliseconds.

Annie B
  • 637
  • 1
  • 9
  • 14
2

The reason the regex is so slow is because it not only has to go through the whole string, but it has to several calculations at every character.

The first one simply does this:

Does f match h? No.
Does b match h? No.
Does h match h? Yes.
Does e match e? Yes.
Does l match l? Yes.
Does l match l? Yes.
Does o match o? Yes.
Done. Match found.

The second one does this:

Does f match g? No.
Does b match g? No.
Does h match g? No.
Does f match o? No.
Does b match o? No.
Does h match o? No.
Does f match o? No.
Does b match o? No.
Does h match o? No.
Does f match d? No.
Does b match d? No.
Does h match d? No.
Does f match b? No.
Does b match b? Yes.
Does a match y? No.
Does h match b? No.
Does f match y? No.
Does b match y? No.
Does h match y? No.
Does f match e? No.
Does b match e? No.
Does h match e? No.
... 999 more times ...
Done. No match found.

I can only speculate about the difference between the any and regex, but I'm guessing the regex is slower mostly because it runs in a highly complex engine, and with state machine stuff and everything, it just isn't as efficient as a specific implementation (in).

In the first string, the regex will find a match almost instantaneously, while any has to loop through the string twice before finding anything.

In the second string, however, the any performs essentially the same steps as the regex, but in a different order. This seems to point out that the any solution is faster, probably because it is simpler.

Specific code is more efficient than generic code. Any knowledge about the problem can be put to use in optimizing the solution. Simple code is preferred over complex code. Essentially, the regex is faster when the pattern will be near the start of the string, but in is faster when the pattern is near the end of the string, or not found at all.

Disclaimer: I don't know Python. I know algorithms.

Kendall Frey
  • 43,130
  • 20
  • 110
  • 148
  • 2
    Sure, but that's 3 operations per character in the string. The `any` goes over the string 3 times. I expected them to be about the same in this case, but the regex is slower by an order of magnitude... – cha0site Jun 25 '12 at 14:20
  • @cha0site: See the updated answer. The difference seems quite large, and I would attribute it to the inefficiency of the regex engine. – Kendall Frey Jun 25 '12 at 14:35
  • I think I found the answer - it's not random inefficiency, Python regexs use a slow algorithm. I'm writing it up for an answer. – cha0site Jun 25 '12 at 15:20
1

You have a regexp that is made up of three regexps. Exactly how do you think that works, if the regexp doesn't check this three times? :-) There's no magic in computing, you still have to do three checks.

But the regexp will do each three tests character by character, while the "one()" method will check the whole string for one match before going onto the next one.

That the regexp is much faster in the first case is because you check for the string that will match last. That means one() needs to first look through the whole string for "foo", then for "bar" and then for "hello", where it matches. Move "hello" first, and one() and two() are almost the same speed, as the first match done in both cases succeed.

Regexps are much more complex tests than "in" so I'd expect it to be slower. I suspect that this complexity increases a lot when you use "|", but I haven't read the source for the regexp library, so what do I know. :-)

Lennart Regebro
  • 167,292
  • 41
  • 224
  • 251
  • 4
    There is a lot of magic in computing. For example [Aho-Corasik Algorithm](http://en.wikipedia.org/wiki/Aho–Corasick_string_matching_algorithm) would solve this problem quicker, the regexp algorithm could incorporate this thing. – unkulunkulu Jun 25 '12 at 14:36
  • 1
    Actually, no. Regular expressions are basically a domain specific language that describe a deterministic finite automaton (DFA) -- well, python regexs can have back references which breaks that, but my regex does not have any backreferences. Checking a string against a DFA is supposed to be O(m*n), where m is the length of the DFA and n is the length the string. This isn't magic, this is theory. Additionally, regexs are extremely popular and well understood. I expect them to be optimized - Python's re module is written in C for a reason. – cha0site Jun 25 '12 at 15:19
  • @cha0site: Optimization doesn't remove the necessity of making three checks when checking for three different things. – Lennart Regebro Jun 25 '12 at 21:41
  • @unkulunkulu: The Aho-Corasik Algorithm doesn't help much here, because the overlap between the three strings is very small. Hence you still really have three different things to check. But yes, a better optimized engine should be more or less as fast in this case, while Pythons re engine is way slower in this case. And calling it "magic" is silly. – Lennart Regebro Jun 25 '12 at 21:43
  • @LennartRegebro if you use a DFA based regex algorithm, you really only need to do one operation per character in the input string. There's only one possible DFA node you can be in at one time, and you perform one node-transition per input character. – Erik Haliewicz Sep 05 '18 at 21:16