Speed up a series of regex replacement in python

Question

My python script would read each line in file and do many regex replacements in each line.

If the regex success, skip to the next line

Is there any way to speed up this kind of script?
Is it worth to call subn instead and check if replacement done and then skip to the remain one?
If I compile the regex, is it possible to store all the compiled regex in memory?

for file in files:  
     for line in file:  
         re.sub() # <--- ~ 100 re.sub

PS: the replacement vaires for each regex

You should show your code (or a simplified example)- it would both make it clearer what you're doing and indicate what can be sped up. — David Robinson, Aug 25 '12 at 04:31
I find some post said that python actually would compile the regex and cache it. Btw, I have near 100 regex, is that really possible to store all the compile expression. — Bear, Aug 25 '12 at 04:37
Python will cache the compiled version of recently-used regular expressions, but it is a limited number. I think 100 is too much and you should compile the regular expressions and put them in a list. — kindall, Aug 25 '12 at 04:42
how about subn, is there better method rather than checking result of subn of each regex? — Bear, Aug 25 '12 at 04:47
What size file are you parsing? How many files? You should check out this post on the time complexity of regular expressions: http://stackoverflow.com/questions/4378455/what-is-the-complexity-of-regular-expression. If you are really performing that many regular expressions, than the worst case time is O(n), for a single string match. Knowning that, you may think about alternative methods of parsing the file, or perhaps threading the application? Also check out this: http://swtch.com/~rsc/regexp/regexp1.html — Peter Kirby, Aug 25 '12 at 04:50
Please show some examples. Is the replacement part always the same, or does it vary for each regex? — Tim Pietzcker, Aug 25 '12 at 05:54
it varies. So I think it is difficult for me to concatenate them — Bear, Aug 25 '12 at 07:10
You might be able to look at lexer code, e.g. http://pygments.org/docs/lexerdevelopment/. This is consistent with kindall's idea of putting the regular-expressions in a list. — Aaron Newton, Aug 25 '12 at 13:29
An example of several charactiristic regexes might help to provide an answer. How large are the files? How many files are there? — jfs, Aug 26 '12 at 02:08

score 2 · Accepted Answer · answered Aug 25 '12 at 05:53

2

You should probably do three things:

Reduce the number of regexes. Depending on differences in the substitution part, you might be able to combine them all into a single one. Using careful alternation, you can determine the sequence in which parts of the regex will be matched.
If possible (depending on file size), read the file into memory completely.
Compile your regex (only for readability; it won't matter in terms of speed as long as the number of regexes stays below 100).

This gives you something like:

regex = re.compile(r"My big honking regex")
for datafile in files:
    content = datafile.read()
    result = regex.sub("Replacement", content)

answered Aug 25 '12 at 05:53

Tim Pietzcker

328,213
58
503
561

But if I concatenate all regex into one, I can't have the specific replacement for each regex, right? – Bear Aug 25 '12 at 07:07
@Bear: It depends on what you're replacing with what. Please show some example of what you're doing. – Tim Pietzcker Aug 25 '12 at 07:43
You don't really need to assign `datafile.read()` into `content` since it is used exactly once - using it directly might have a minor performance benefit. You might also be able to save memory by creating a Python iterator for `files` - http://www.ibm.com/developerworks/library/l-pycon/index.html – Aaron Newton Aug 25 '12 at 12:28
@AaronNewton: There definitely would be no measurable performance benefit. In this case, I prefer the slightly longer, more explicit version. – Tim Pietzcker Aug 25 '12 at 13:27
Sorry, that's true. But what about `[regex.sub("Replacement", datafile.read()) for datafile in files]`? Although many people seem to think that the list comprehension isn't necessarily faster, or useful if you don't want a list http://forums.xkcd.com/viewtopic.php?f=11&t=54443 – Aaron Newton Aug 26 '12 at 00:18
@AaronNewton: Using a list comprehension solely for its side effects is definitely discouraged. – Tim Pietzcker Aug 26 '12 at 08:00
That's really getting down to a matter of opinion. It really IS a good way of doing things if you know that you want all your results in a list, and there is no need to go into a loop and assign values etc. I would also argue that being concise is also a good side effect, but I really think it's a small point here - let's move on. – Aaron Newton Aug 26 '12 at 10:46

score 2 · Answer 2 · answered Aug 26 '12 at 01:26

As @Tim Pietzcker said, you could reduce the number of regexes by making them alternatives. You can determine which alternative matched by the using the 'lastindex' attribute of the match object.

Here's an example of what you could do:

>>> import re
>>> replacements = {1: "<UPPERCASE LETTERS>", 2: "<lowercase letters>", 3: "<Digits>"}
>>> def replace(m):
...     return replacements[m.lastindex]
...
>>> re.sub(r"([A-Z]+)|([a-z]+)|([0-9]+)", replace, "ABC def 789")
'<UPPERCASE LETTERS> <lowercase letters> <Digits>'

`replacements = [None, "", "", ""]` also works. – jfs Aug 27 '12 at 09:16 — jfs, Aug 27 '12 at 09:16

Speed up a series of regex replacement in python

2 Answers2

Linked