How to compile multiple multiple regexes in one go? Is it more efficient? - python

Question

Let's say I have a code as such:

import re
docid_re = re.compile(r'<DOCID>([^>]+)</DOCID>')
doctype_re = re.compile(r'<DOCTYPE SOURCE="[^"]+">([^>]+)</DOCTYPE>')
datetime_re = re.compile(r'<DATETIME>([^>]+)</DATETIME>')

I could also do this:

>>> import re
>>> docid_re = r'<DOCID>([^>]+)</DOCID>'
>>> doctype_re = r'<DOCTYPE SOURCE="[^"]+">([^>]+)</DOCTYPE>'
>>> datetime_re = r'<DATETIME>([^>]+)</DATETIME>'
>>> docid_re, doctype_re, datetime_re = map(re.compile, [docid_re, doctype_re, datetime_re])
>>> docid_re
<_sre.SRE_Pattern object at 0x7f0314eee438>

But is there any real gain in speed or memory when I use the map()?

Are you parsing xml with regex? – Rafael Barros Sep 30 '15 at 18:40 — Rafael Barros, Sep 30 '15 at 18:40
Have you tried measuring it? – Morgan Thrapp Sep 30 '15 at 18:46 — Morgan Thrapp, Sep 30 '15 at 18:46

score 1 · Answer 1 · answered Sep 30 '15 at 20:08

If you were compiling a lot of regexes, map might help by avoiding lookup costs involved in finding re, then getting its compile attribute each call; with map, you look up map once and re.compile once, and then it gets used over and over without further lookups. Of course, when you need to construct a list to use it, you eat into that savings. Practically speaking, you'd need an awful lot of regexes to reach the point where map would be worth your while; for three, it's probably a loss.

Even when it did help, it would be the tiniest of microoptimizations. I would do it if it made the code cleaner, performance is a tertiary concern here at best. There are cases (say, parsing a huge text file of integers into ints) where map can be a big win because the overhead of starting it up is compensated for by the reduced lookup and Python byte code execution overhead. But this is not one of those cases, and those cases are so rare as to not be worth worrying about 99.99% of the time.

score 1 · Accepted Answer · answered Sep 30 '15 at 20:44

1

Do not listen anybody - just measure it! You can use timeit module for it. But remember, that "premature optimization is the root of all evil" (c) Donald Knuth.

Btw, answer on your question "No, it doesn't help at all".

answered Sep 30 '15 at 20:44

Jimilian

3,859
30
33

Actually, the `timeit` module is going to deceive you in this case. [`re.compile` caches the compiled forms of the regexes it compiles](https://docs.python.org/3/library/re.html#re.compile), so you might see perf improvements as a substantial benefit relative to total cost, when in fact, in real code where the compiled regex won't already be in the cache, and the savings are microscopic compared the the expense of compilation in the first place. You'd need to explicitly `re.purge()` the cache on every loop (but of course, that adds a different sort of confounding overhead). – ShadowRanger Sep 30 '15 at 21:01
@ShadowRanger, yes, you are right, also in real application regex will be compiled and cached after first usage as well. Without any `re.compile` call. – Jimilian Oct 01 '15 at 07:22
@Jimilian, and how is the regex compiled and cached without running `re.compile`? – alvas Oct 02 '15 at 11:49
1

@alvas, automatically - ShadowRanger already put a [link](https://docs.python.org/2/library/re.html#re.compile) to official documentation about it. You can read it in `Note` part. – Jimilian Oct 02 '15 at 14:55
1

The main difference between explicit compiles and implicit compiles is that the explicit compile means you get a `regex` object back that can be used directly; if the object is local or on a class instance, it's cheaper to look up than `re` imported into the global namespace (thanks to [`LEGB`](https://blog.mozilla.org/webdev/2011/01/31/python-scoping-understanding-legb/) search). When you use the module level functions, it still has to look up the pattern in the cache to find the `regex`, the `regex` may have aged out if you used a bunch of other `regex`, etc. Explicit `compile` avoids that. – ShadowRanger Oct 02 '15 at 18:50
1

Frankly, my favorite thing about compiling a regex is that it lets you use many `regex` methods with one argument, which means it can be easily used w/functional methods like `map`, `filter` (and similar methods on `multiprocessing.Pool`); `map` and `filter` are normally slower than an equivalent list comprehension (Py2) or generator expression (Py3), but for largish iterables, when the mapping function/filter predicate is implemented in `C` (`regex` methods mostly are), `map` and `filter` are faster as they avoid lookup costs and push all execution to the C layer (in CPython only of course). – ShadowRanger Oct 02 '15 at 18:55
1

@Jimilian: For the record, I do agree that premature optimization is the root of all evil. The only reason I'm geeking out is that this is one of those cases where I have previously used `timeit` (via `ipython`'s `%timeit` magic) to investigate the performance of stuff like this, so I have enough experience to say "Yes, it can go faster, but no, the speedup will never be meaningful". – ShadowRanger Oct 02 '15 at 18:59

How to compile multiple multiple regexes in one go? Is it more efficient? - python

2 Answers2

Linked