Replacement using multiple regexes or a bigger one in Python

Question

I've switched to Python pretty recently and I'm interested to clean up a very big number of web pages (around 12k) (but can be considered just as easily text files) by removing some particular tags or some other string patterns. For this I'm using the re.sub(..) function in Python.

My question is if it's better (from the efficiency point of view) to create one big regular expression that matches more of my patterns or call the function several times with smaller and simpler regular expressions.

To exemplify, is it better to use something like

 re.sub(r"<[^<>]*>", content)
 re.sub(r"some_other_pattern", content)

or

 re.sub(r"<[^<>]*>|some_other_pattern",content)

Of course, for the sake of exemplifying the previous patterns are really simple and I haven't compiled them here, but in my real-life scenario I will.

LE: The question is not related to the HTML nature of the files, but to the behavior of Python when dealing with multiple regex patterns.

Thanks!

[Obligatory warning about parsing HTML with regexs](http://stackoverflow.com/a/1732454/950912) — brc, Sep 23 '12 at 23:42
Actually, as I said, it's mainly not about removing and parsing HTML text but about removing some particular non-HTML-related patterns. My question can also be put more generally about simple text files and replacing a bunch of patterns in them — Cosmin SD, Sep 23 '12 at 23:44
I think it comes down to how good you are with regex... if you can do it with one than use one ... I would probably break it into several just so its easier to human parse... — Joran Beasley, Sep 23 '12 at 23:49
I think you can profile different approaches to reach your own conclusion. — swang, Sep 24 '12 at 00:03

score 3 · Accepted Answer · answered Sep 23 '12 at 23:59

Keep it simple.

I would say that you are safer using smaller Regexes to parse through this stuff. At least that way if it behaves abnormally, you don't have to go digging to find which particular section of the massive Regex is behaving strangely. Providing you have good logging of the replacements you do, it would be trivial to determine the source of the problem, should one arise.

You don't want to run into this

score 2 · Answer 2 · answered Sep 24 '12 at 06:38

Speaking generally, "sequential" and "parallel" application is not the same and might produce different results, because sequential replacements can affect each other.

As to performance I guess one expression will perform better, but that's just a guess. I personally prefer to keep then complex and use "verbose" mode for readability sake.

score 1 · Answer 3 · answered Nov 22 '12 at 21:48

I understand your additional comment regarding "its the non-HTML parts I'm cleaning up". Because of the possibility of a latter RE finding and replacing content that a earlier RE replaced, you'd be better off using the "alternative" operator and using a single RE.

Also, consider using BeautifulSoup to load and examine your HTML files. This will assist in finding the appropriate parts of your text with far less risk of capturing some HTML construct when you were just intending on on replacing some text.

Replacement using multiple regexes or a bigger one in Python

3 Answers3