Python: why regular expression is slower than replace() method?

Question

I have next 2 blocks of code:

def replace_re(text):
    start = time.time()
    new_text = re.compile(r'(\n|\s{4})').sub('', text)
    finish = time.time()
    return finish - start

def replace_builtin(text):
    start = time.time()
    new_text = text.replace('\n', '').replace('    ', '')
    finish = time.time()
    return finish - start

Than I call both functions with text param (~500kb of source code of one web-page). I thought replace_re() will be much faster, but results are the next:

replace_builtin() ~ 0.008 sec
replace_re() ~ 0.035 sec (nearly 4.5 times slower!!!)

Why is that?

Why wouldn't the regex be slower? A regular expression engine is much more expensive than iterating over a string and checking for matches. — Xophmeister, Jun 29 '12 at 08:36
+1 for profiling, and for showing your code and results. It's a legitimate question, even if many responses are treating it as common knowledge. — Todd A. Jacobs, Jun 29 '12 at 08:49

score 7 · Answer 1 · answered Jun 29 '12 at 08:35

7

Because regular expressions are more than 4.5 times more complex than a fixed string replacement.

answered Jun 29 '12 at 08:35

lanzz

42,060
10
89
98

4

This may be true, but you didn't provide any supporting information for the statement. That's not really very useful, and doesn't answer the OPs question in a meaningful way. – Todd A. Jacobs Jun 29 '12 at 08:56
I don't see this helpful, @VitaliPonomar why this is nice answer? simply because this is almost common sense ? – zinking Jun 29 '12 at 10:19
@zinking I mean funny answer. As you can see I've accepted another answer as the most helpfull... – Vitalii Ponomar Jun 29 '12 at 14:54

score 5 · Answer 2 · answered Jun 29 '12 at 08:39

5

Because an re has to generate a FSM. Then use that to process the string. While a replace can use the underlying string processing functions closer to the lib/OS levels.

answered Jun 29 '12 at 08:39

Jon Clements

138,671
33
247
280

score 3 · Answer 3 · answered Jun 29 '12 at 08:43

Consider what happens in your source string for every single space character -- the regular expression engine must explore the next three characters to see if they are also spaces. Thus every space involves, essentially, looking at the next character also. Most RE engines aren't written quite so ... bluntly ... but it does essentially mean that the engine must return to the start state and potentially throw away data it had built up on the previous match state it was building.

However, searching a string for a fixed pattern (four spaces) can actually be done in sub-linear time. I don't know if the Python replace()is implemented in this fashion or not, but one hopes they've used the tools at their disposal well.

+1 for linking to Boyer-Moore...not like you need the rep, though. :) — Todd A. Jacobs, Jun 29 '12 at 08:53

score 3 · Accepted Answer · answered Jun 29 '12 at 08:47

Rule of Thumb

Fixed-string matches should always faster than regular expression matches. You can Google for various benchmarks, or do what you did and perform your own, but in general you can just assume that this will be true except (possibly) in some unusual edge cases.

Why Regular Expressions are Slower

Why this is true has to do with the fact that fixed-string matches don't have backtracking, compilation steps, ranges, character classes, or a host of other features that slow down the regular expression engine. There are certainly ways to optimize regex matches, but I think it's unlikely to beat indexing into a string in the common case.

Use the Source

If you want a better explanation, you could always look at the source code for the relevant modules to see how they do what they do. That would certainly give you more information about why any particular implementation performs as it does.

Python: why regular expression is slower than replace() method?

4 Answers4

Rule of Thumb

Why Regular Expressions are Slower

Use the Source

Linked