4

First off - my code works. It just runs slowly, and I'm wondering if i'm missing something that will make it more efficient. I'm parsing PDFs with python (and yes, I know that this should be avoided if at all possible).

My problem is that i have to do several rather complex regex substitutions - and when i say substitution, I really mean deleting. I have done the ones that strip out the most data first so that the next expressions don't need to analyze too much text, but that's all I can think of to speed things up.

I'm pretty new to python and regexes, so it's very conceivable this could be done better.

Thanks for reading.

    regexPagePattern = r"(Wk)\d{1,2}.\d{2}(\d\.\d{1,2})"
    regexCleanPattern = r"(\(continued\))?((II)\d\.\d{1,2}|\d\.\d{1,2}(II)|\d\.\d{1,2})"
    regexStartPattern = r".*(II)(\s)?(INDEX OF CHARTS AFFECTED)"
    regexEndPattern = r"(II.)\d{1,5}\((P|T)\).*"
    contentRaw = re.sub(regexStartPattern,"",contentRaw)
    contentRaw = re.sub(regexEndPattern,"",contentRaw)
    contentRaw = re.sub(regexPagePattern,"",contentRaw)
    contentRaw = re.sub(regexCleanPattern,"",contentRaw)
gruvn
  • 692
  • 1
  • 6
  • 25
  • 1
    Those regexes don't seem very complex to me. The bigger question is -- how big is `contentRaw`? – ruakh Mar 13 '12 at 13:30
  • Maybe they're not, but they took me a while to come up with. :) 'contentRaw' is typically ~150kb (or ~125,000 characters). – gruvn Mar 13 '12 at 14:08

2 Answers2

4

I'm not sure if you do this inside of a loop. If not the following does not apply.

If you use a pattern multiple times you should compile it using re.compile( ... ). This way the pattern is only compiled once. The speed increase should be huge. Minimal example:

>>> a="a b c d e f"
>>> re.sub(' ', '-', a)
'a-b-c-d-e-f'
>>> p=re.compile(' ')
>>> re.sub(p, '-', a)
'a-b-c-d-e-f'

Another idea: Use re.split( ... ) instead of re.sub and operate on the array with the resulting fragments of your data. I'm not entirely sure how it is implemented, but I think re.sub creates text fragments and merges them into one string in the end, which is expensive. After the last step you can join the array using " ".join(fragments). Obviously, This method will not work if your patterns overlap somewhere.

It would be interesting to get timing information for your program before and after your changes.

hochl
  • 12,524
  • 10
  • 53
  • 87
  • In the OP's sample code, the re's are only used once each, so adding compilation is probably not too big a win. And the re module does caching of the compiled expressions internally, so I'm not sure there will be a huge speed increase in any case. – PaulMcG Mar 13 '12 at 14:14
  • Yes, this has been discussed on SO a [lot](http://stackoverflow.com/q/452104/589206). The question is if you can/want to rely on implementation details of the Python library, and how often you use the pattern. – hochl Mar 13 '12 at 14:20
0

Regex are always the last choice when trying to decode strings. So if you see another possibility to solve your problem, use that.

That said, you could use re.compile to precompile your regex patterns:

regexPagePattern = re.compile(r"(Wk)\d{1,2}.\d{2}(\d\.\d{1,2})")
regexPagePattern.sub("",contentRaw)

That should speed things up a bit (a pretty nice bit ;) )

marue
  • 5,588
  • 7
  • 37
  • 65
  • 1
    That's right: the first choice when decoding strings should be the `decode` function. :) – tchrist Mar 13 '12 at 13:40
  • 1m23s new processing time 1m24s old processing time So we're not saving a ton of time, but we're coding better. Interestingly, the first regex run `r".*(II)(\s)?(INDEX OF CHARTS AFFECTED)"` takes about 1m15, while the subsequent ones are almost instantaneous. – gruvn Mar 13 '12 at 14:39