headingsFix
strips out all the line endings, which you presumably did not intend. However, your question is about why changing the order of transformations results in slower execution, so I'll not discuss fixing that here.
fixUppercase
is extremely inefficient at handling lines with many words. It repeatedly calls line.split()
over and over again on the entire book-length string. That isn't terribly slow if each line has maybe a dozen words, but it gets extremely slow if you have one enormous line with tens of thousands of words. I found your program runs vastly faster with this change to only split each line once. (I note that I can't say whether your program is correct as it stands, just that this change should have the same behaviour while being a lot faster. I'm afraid I don't particularly understand why it's comparing each word to see if it's the same as the last word on the line.)
def fixUppercase(doc):
fixedText = ''
for line in doc.split('\n'):
line_words = line.split() # Split the line once here.
fixedLine = ''
for word in line_words:
if (
word.isalpha()
and (
word.isupper()
or word.istitle()
or word.islower()
)
):
if word == line_words[-1]: # No need to split here.
fixedLine += word + '\n'
else:
fixedLine += word + ' '
elif (
word.isalpha()
):
lower = word.lower()
if word == line_words[-1]: # No need to split here.
fixedLine += lower + '\n'
else:
fixedLine += lower + ' '
else:
if word == line_words[-1]: # No need to split here.
fixedLine += word + '\n'
else:
fixedLine += word + ' '
fixedText += fixedLine
return fixedText
Here you can see my timings. I download 'Alice in Wonderland' from Project Gutenberg to use as test input.
annette@DISSONANCE:~/scratch$ wget 'https://www.gutenberg.org/files/11/11-0.txt' -O alice.txt
--2021-06-13 02:06:33-- https://www.gutenberg.org/files/11/11-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 174313 (170K) [text/plain]
Saving to: ‘alice.txt’
alice.txt 100%[============================================================================================================================>] 170.23K 175KB/s in 1.0s
2021-06-13 02:06:35 (175 KB/s) - ‘alice.txt’ saved [174313/174313]
annette@DISSONANCE:~/scratch$ time python slow_ocr_cleanup.py --headings-last < alice.txt > alice1.txt
real 0m0.065s
user 0m0.047s
sys 0m0.016s
annette@DISSONANCE:~/scratch$ time python slow_ocr_cleanup.py --headings-first < alice.txt > alice2.txt
^CTraceback (most recent call last):
File "slow_ocr_cleanup.py", line 117, in <module>
main()
File "slow_ocr_cleanup.py", line 106, in main
doc = fixUppercase(doc)
File "slow_ocr_cleanup.py", line 17, in fixUppercase
if word == line.split()[-1]:
KeyboardInterrupt
real 0m16.856s
user 0m8.438s
sys 0m8.375s
annette@DISSONANCE:~/scratch!1$ time python slow_ocr_cleanup.py --fixed < alice.txt > alice3.txt
real 0m0.058s
user 0m0.047s
sys 0m0.000s
As you can see, running without the fix was taking a long time so I stopped it early.
Here's the full test program:
import sys
def fixUppercase(doc):
fixedText = ''
for line in doc.split('\n'):
fixedLine = ''
for word in line.split():
if (
word.isalpha()
and (
word.isupper()
or word.istitle()
or word.islower()
)
):
if word == line.split()[-1]:
fixedLine += word + '\n'
else:
fixedLine += word + ' '
elif (
word.isalpha()
):
lower = word.lower()
if word == line.split()[-1]:
fixedLine += lower + '\n'
else:
fixedLine += lower + ' '
else:
if word == line.split()[-1]:
fixedLine += word + '\n'
else:
fixedLine += word + ' '
fixedText += fixedLine
return fixedText
def fixUppercaseFast(doc):
fixedText = ''
for line in doc.split('\n'):
line_words = line.split()
fixedLine = ''
for word in line_words:
if (
word.isalpha()
and (
word.isupper()
or word.istitle()
or word.islower()
)
):
if word == line_words[-1]:
fixedLine += word + '\n'
else:
fixedLine += word + ' '
elif (
word.isalpha()
):
lower = word.lower()
if word == line_words[-1]:
fixedLine += lower + '\n'
else:
fixedLine += lower + ' '
else:
if word == line_words[-1]:
fixedLine += word + '\n'
else:
fixedLine += word + ' '
fixedText += fixedLine
return fixedText
def headingsFix(doc):
fixedText = ''
count = 0
stopWords = ['on', 'and', 'of', 'as', 'for']
for line in doc.split('\n'):
tokenLine = ''
for word in line.split():
if word not in stopWords:
tokenLine += word + " "
if tokenLine.istitle() and (
not line.endswith('.')
and not line.endswith(',')
and not line.endswith(')')
and not line.endswith(';')
and not line.endswith(':')
):
count += 1
else:
fixedText += line
return fixedText
def main():
doc = sys.stdin.read()
if '--headings-last' in sys.argv[1:]:
doc = fixUppercase(doc)
doc = headingsFix(doc)
elif '--headings-first' in sys.argv[1:]:
doc = headingsFix(doc)
doc = fixUppercase(doc)
elif '--fixed' in sys.argv[1:]:
doc = headingsFix(doc)
doc = fixUppercaseFast(doc)
else:
print('Specify --headings-last, --headings-first or --fixed', file=sys.stderr)
sys.exit(1)
print(doc, end='')
if __name__ == '__main__':
main()
You'll note that the string concatenation isn't the source of the problem, although it's still inadvisable. In some some versions of Python there's an optimisation that makes it fast, but in general you can't rely on it always to work. This question and answer explain the problem in more detail, but broadly speaking, repeatedly using +
or +=
to build larger and larger strings in a loop is inefficient because every time the whole string needs to be copied, and it's getting longer and longer as the loop goes on. It's a notorious pitfall known as Schlemiel the Painter's Algorithm. Better alternatives are to use str.join
or io.StringIO
.