First off - my code works. It just runs slowly, and I'm wondering if i'm missing something that will make it more efficient. I'm parsing PDFs with python (and yes, I know that this should be avoided if at all possible).
My problem is that i have to do several rather complex regex substitutions - and when i say substitution, I really mean deleting. I have done the ones that strip out the most data first so that the next expressions don't need to analyze too much text, but that's all I can think of to speed things up.
I'm pretty new to python and regexes, so it's very conceivable this could be done better.
Thanks for reading.
regexPagePattern = r"(Wk)\d{1,2}.\d{2}(\d\.\d{1,2})"
regexCleanPattern = r"(\(continued\))?((II)\d\.\d{1,2}|\d\.\d{1,2}(II)|\d\.\d{1,2})"
regexStartPattern = r".*(II)(\s)?(INDEX OF CHARTS AFFECTED)"
regexEndPattern = r"(II.)\d{1,5}\((P|T)\).*"
contentRaw = re.sub(regexStartPattern,"",contentRaw)
contentRaw = re.sub(regexEndPattern,"",contentRaw)
contentRaw = re.sub(regexPagePattern,"",contentRaw)
contentRaw = re.sub(regexCleanPattern,"",contentRaw)