0

While working on a script to correct formatting errors from documents produced by OCR, I ran into an issue where depending on which loop I run first, the program runs about 80% slower.

Here is a simplified version of the code. I have the following loops to check for uppercase errors (e.g. "posSible"):

def fixUppercase(doc):
    fixedText = ''
    for line in doc.split('\n'):
        fixedLine = ''
        for word in line.split():
            if (
                word.isalpha()
                and (
                     word.isupper()
                     or word.istitle()
                     or word.islower()
                )
            ):
                if word == line.split()[-1]:
                    fixedLine += word + '\n'
                else:
                    fixedLine += word + ' ' 
            elif (
                word.isalpha()
            ):
                lower = word.lower()
                if word == line.split()[-1]:
                    fixedLine += lower + '\n'
                else:
                    fixedLine += lower + ' '
            else:
                if word == line.split()[-1]:
                    fixedLine += word + '\n'
                else:
                    fixedLine += word + ' ' 
        fixedText += fixedLine  
    
    return fixedText

The following loop checks for and removes headings:

def headingsFix(doc):
    fixedText = ''
    count = 0
    stopWords = ['on', 'and', 'of', 'as', 'for']
    for line in doc.split('\n'):
        tokenLine = ''
        for word in line.split():
            if word not in stopWords:
                tokenLine += word + " "
        if tokenLine.istitle() and (
            not line.endswith('.') 
            and not line.endswith(',')
            and not line.endswith(')')
            and not line.endswith(';')
            and not line.endswith(':')
        ):

            count += 1
        else:
            fixedText += line
            
    return fixedText

It's the loop in the fixedUppercase function that massively slows down. If I run any other function or loop prior to that one or If I run that one first or remove it entirely, the program is quick. Same behavior if both loops are part of one function.

I thought maybe another function or loop was causing the error by expanding the length of the document, but a check with len() shows same doc size either way.

Will Da Silva
  • 6,386
  • 2
  • 27
  • 52
  • Could you be a bit more explicit about what you mean by "...massively slows down if I run any other function or loop prior to that one"? Could you perhaps give a specific example of running another function first that makes that one measurably slower? – Weeble Jun 13 '21 at 00:26
  • You'll find some helpful guidelines for creating a minimal reproducible example here: https://stackoverflow.com/help/minimal-reproducible-example To give you some idea of the problems facing anyone trying to answer your question: 1. we don't know your input - does it have long lines? many lines? many words? 2. we don't know how to run your code to reproduce your problem - what other function should we run prior, and how should we do so, which inputs and outputs go where? – Weeble Jun 13 '21 at 00:35
  • Hi, I did give an example: if I run headingsFix() before fixUppercase(), the latter takes about 80% longer to complete. P.S. I know I could accomplish the same with a regular expression, I'm just curious as to why the speed diff. – theotechne Jun 13 '21 at 00:35
  • Hi Weeble, the input is string. You can try it with any string. Obv. the smaller the string the smaller the time and diff. With 300k words, it's several minutes. Just pass a string to one function then another, time it, do it again with order reversed. – theotechne Jun 13 '21 at 00:40
  • without running it, this is probably slow because it does a lot of string building, which is very slow in Python because strings are immutable and so it must create a new one each time! you can use [timeit](https://docs.python.org/3/library/timeit.html) to compare the speeds of functions in Python! – ti7 Jun 13 '21 at 00:43
  • Your `headingsFix` seems to remove all line breaks from the text. If a function like that is run before `fixUppercase` it might slow it down a lot because there are a lot of (mostly) unnecessary calls to `line.split` and this could take a lot of time if the document is a single very long line. – Stuart Jun 13 '21 at 00:44
  • Make sure your other functions are not removing line breaks, and restructure `fixUppercase` so it doesn't repeatedly split the same line just to check if each word is the last one in the line. Append words to a list instead and do `fixedText += fixedLine.join(" ") + "\n"` to add each line to the text – Stuart Jun 13 '21 at 00:49
  • There is a serious logic error in your `fixedUpper` - you check if a word is the last word on the line, but if a line would read `hello, I just wanted to say hello`, it will think it has reached the end of the line at the first word - apart from that error, it's highly inefficient. – Grismar Jun 13 '21 at 00:50
  • Thanks, Stuart, you were right that it was the removal of line breaks! – theotechne Jun 13 '21 at 00:58
  • Grismar, thanks for pointing this out. It would have been a very rare occurrence given some other parts in my pipeline, and I'm just using a regex here, but you're correct nevertheless. (But your example wouldn't produce the error since it is 'hello,'.) – theotechne Jun 13 '21 at 01:03
  • Since you're concerned about run time, consider using `io.StringIO` instead of `str` for the values to which you're appending `str`S – Michael Ruth Jun 13 '21 at 02:50

2 Answers2

2

headingsFix strips out all the line endings, which you presumably did not intend. However, your question is about why changing the order of transformations results in slower execution, so I'll not discuss fixing that here.

fixUppercase is extremely inefficient at handling lines with many words. It repeatedly calls line.split() over and over again on the entire book-length string. That isn't terribly slow if each line has maybe a dozen words, but it gets extremely slow if you have one enormous line with tens of thousands of words. I found your program runs vastly faster with this change to only split each line once. (I note that I can't say whether your program is correct as it stands, just that this change should have the same behaviour while being a lot faster. I'm afraid I don't particularly understand why it's comparing each word to see if it's the same as the last word on the line.)

def fixUppercase(doc):
    fixedText = ''
    for line in doc.split('\n'):
        line_words = line.split()   # Split the line once here.
        fixedLine = ''
        for word in line_words:
            if (
                word.isalpha()
                and (
                     word.isupper()
                     or word.istitle()
                     or word.islower()
                )
            ):
                if word == line_words[-1]:   # No need to split here.
                    fixedLine += word + '\n'
                else:
                    fixedLine += word + ' '
            elif (
                word.isalpha()
            ):
                lower = word.lower()
                if word == line_words[-1]:   # No need to split here.
                    fixedLine += lower + '\n'
                else:
                    fixedLine += lower + ' '
            else:
                if word == line_words[-1]:   # No need to split here.
                    fixedLine += word + '\n'
                else:
                    fixedLine += word + ' '
        fixedText += fixedLine

    return fixedText

Here you can see my timings. I download 'Alice in Wonderland' from Project Gutenberg to use as test input.

annette@DISSONANCE:~/scratch$ wget 'https://www.gutenberg.org/files/11/11-0.txt' -O alice.txt
--2021-06-13 02:06:33--  https://www.gutenberg.org/files/11/11-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 174313 (170K) [text/plain]
Saving to: ‘alice.txt’

alice.txt                                               100%[============================================================================================================================>] 170.23K   175KB/s    in 1.0s

2021-06-13 02:06:35 (175 KB/s) - ‘alice.txt’ saved [174313/174313]

annette@DISSONANCE:~/scratch$ time python slow_ocr_cleanup.py --headings-last < alice.txt > alice1.txt

real    0m0.065s
user    0m0.047s
sys     0m0.016s
annette@DISSONANCE:~/scratch$ time python slow_ocr_cleanup.py --headings-first < alice.txt > alice2.txt
^CTraceback (most recent call last):
  File "slow_ocr_cleanup.py", line 117, in <module>
    main()
  File "slow_ocr_cleanup.py", line 106, in main
    doc = fixUppercase(doc)
  File "slow_ocr_cleanup.py", line 17, in fixUppercase
    if word == line.split()[-1]:
KeyboardInterrupt

real    0m16.856s
user    0m8.438s
sys     0m8.375s
annette@DISSONANCE:~/scratch!1$ time python slow_ocr_cleanup.py --fixed < alice.txt > alice3.txt

real    0m0.058s
user    0m0.047s
sys     0m0.000s

As you can see, running without the fix was taking a long time so I stopped it early.

Here's the full test program:

import sys


def fixUppercase(doc):
    fixedText = ''
    for line in doc.split('\n'):
        fixedLine = ''
        for word in line.split():
            if (
                word.isalpha()
                and (
                     word.isupper()
                     or word.istitle()
                     or word.islower()
                )
            ):
                if word == line.split()[-1]:
                    fixedLine += word + '\n'
                else:
                    fixedLine += word + ' '
            elif (
                word.isalpha()
            ):
                lower = word.lower()
                if word == line.split()[-1]:
                    fixedLine += lower + '\n'
                else:
                    fixedLine += lower + ' '
            else:
                if word == line.split()[-1]:
                    fixedLine += word + '\n'
                else:
                    fixedLine += word + ' '
        fixedText += fixedLine

    return fixedText


def fixUppercaseFast(doc):
    fixedText = ''
    for line in doc.split('\n'):
        line_words = line.split()
        fixedLine = ''
        for word in line_words:
            if (
                word.isalpha()
                and (
                     word.isupper()
                     or word.istitle()
                     or word.islower()
                )
            ):
                if word == line_words[-1]:
                    fixedLine += word + '\n'
                else:
                    fixedLine += word + ' '
            elif (
                word.isalpha()
            ):
                lower = word.lower()
                if word == line_words[-1]:
                    fixedLine += lower + '\n'
                else:
                    fixedLine += lower + ' '
            else:
                if word == line_words[-1]:
                    fixedLine += word + '\n'
                else:
                    fixedLine += word + ' '
        fixedText += fixedLine

    return fixedText


def headingsFix(doc):
    fixedText = ''
    count = 0
    stopWords = ['on', 'and', 'of', 'as', 'for']
    for line in doc.split('\n'):
        tokenLine = ''
        for word in line.split():
            if word not in stopWords:
                tokenLine += word + " "
        if tokenLine.istitle() and (
            not line.endswith('.')
            and not line.endswith(',')
            and not line.endswith(')')
            and not line.endswith(';')
            and not line.endswith(':')
        ):

            count += 1
        else:
            fixedText += line

    return fixedText


def main():
    doc = sys.stdin.read()
    if '--headings-last' in sys.argv[1:]:
        doc = fixUppercase(doc)
        doc = headingsFix(doc)
    elif '--headings-first' in sys.argv[1:]:
        doc = headingsFix(doc)
        doc = fixUppercase(doc)
    elif '--fixed' in sys.argv[1:]:
        doc = headingsFix(doc)
        doc = fixUppercaseFast(doc)
    else:
        print('Specify --headings-last, --headings-first or --fixed', file=sys.stderr)
        sys.exit(1)
    print(doc, end='')


if __name__ == '__main__':
    main()

You'll note that the string concatenation isn't the source of the problem, although it's still inadvisable. In some some versions of Python there's an optimisation that makes it fast, but in general you can't rely on it always to work. This question and answer explain the problem in more detail, but broadly speaking, repeatedly using + or += to build larger and larger strings in a loop is inefficient because every time the whole string needs to be copied, and it's getting longer and longer as the loop goes on. It's a notorious pitfall known as Schlemiel the Painter's Algorithm. Better alternatives are to use str.join or io.StringIO.

Weeble
  • 17,058
  • 3
  • 60
  • 75
  • Thanks for the suggestion. It never occurred to me that every iteration would call the method again if it was in the header. (I would give you an upvote, but my rep is too low.) Q: how does this relate to the walrus operator? It's not reinstantiating the var on each iteration? – theotechne Jun 13 '21 at 01:24
  • @theotechne The walrus operator is a name for the assignment expression introduced in Python 3.8 using `:=` and I'm afraid I don't know how it relates to your question. I am afraid I don't really understand what you mean here when you say "in the header" or "reinstantiating the var on each iteration"... are you asking about splitting the words in a line, or about string concatenation, or something else? – Weeble Jun 13 '21 at 01:32
  • Oh, wait, you were referring to the 'line.split()'s in the conditionals (if/elif/else)? Sorry, I guess that's another reason to not write the code like that. I lost track of those and thought you were referring to assigning line.split() to a var in first block instead of the second block loop header. (Also thanks for showing me what a good reproducible example looks like - still learning.) – theotechne Jun 13 '21 at 01:40
  • Ah, sorry I wasn't clear. Yes, the calls to `line.split()` inside the body of the loop are the ones that are being called over and over. – Weeble Jun 13 '21 at 01:44
  • 1
    I upvoted your answer but: "In this case CPython has an optimisation it can apply when you're appending to a string" seems to be discouraged per [https://stackoverflow.com/questions/35787022/cython-string-concatenation-is-super-slow-what-else-does-it-do-poorly/35791033#35791033](https://stackoverflow.com/questions/35787022/cython-string-concatenation-is-super-slow-what-else-does-it-do-poorly/35791033#35791033) – DarrylG Jun 13 '21 at 02:07
  • @DarrylG yes, I'm pretty sure this optimization has been removed in recent versions of CPython so that people stop writing inefficient code. – juanpa.arrivillaga Jun 13 '21 at 03:16
  • Thanks @DarrylG, I've added that link in the answer and made it a bit more forceful in discouraging looped concatenation. – Weeble Jun 13 '21 at 23:20
  • @juanpa.arrivillaga Do you have a link for that? As far as I can tell it's still quick in Python 3.9 and it looks like there's still code in place for the optimisation in CPython: https://github.com/python/cpython/blob/bf527277d4e4907e32d76ca7ba667ab3149fe258/Objects/unicodeobject.c#L11492-L11497 – Weeble Jun 13 '21 at 23:38
  • @Weeble I could be totally wrong, or maybe, I only saw a proposal for it to be removed. – juanpa.arrivillaga Jun 14 '21 at 01:50
1

Your fixUppercase() function basically does this:

  • change all alphabetical words that are not all lowercase, a proper title or all uppercase to all lowercase

However, you assume a document would only contain \n and as whitespace, so tabs (for example) would break your code. You could instead break up the document in space metacharacters and strings of the rest using regular expressions.

Your main problem is caused by the inefficiency of fixedUpper, so a solution would be to fix that.

This would do the same, but more efficiently:

import re

example="""
This is an example.

It Has:\ta fEw examples of thIngs that should be FIXED and CHANGED!

Don't touch this: a123B or this_Is_finE

Did it woRk?
"""


def fixedUpper(doc):
    p = re.compile(r'\s|([^\s]+)')
    # go through all the matches and join them back together into a string when done
    return ''.join(
        # lowercase for any alphabetic substring that does not contain whitespace and isn't a title or all uppercase
        m.group(1).lower() if not (m.group(1) is None or m.group(1).istitle() or m.group(1).isupper()) and m.group(1).isalpha()
        # in all other cases, just leave the match untouched
        else m.group(0)
        for m in p.finditer(doc)
    )


print(repr(fixedUpper(example)))

Output (note how it preserved the whitespace):

"\nThis is an example.\n\nIt Has:\ta few examples of things that should be FIXED and CHANGED!\n\nDon't touch this: a123B or this_Is_finE\n\nDid it woRk?\n"

Also note that this still has the problem your code does as well: if there's interpunction at the end of a word, it's not fixed, like woRk?

This is better:

def fixedUpper(doc):
    p = re.compile(r'\s|((\w+)([^\w\s]*))')
    return ''.join(
        m.group(1).lower()
        if not (m.group(2) is None or m.group(2).istitle() or m.group(2).isupper()) and m.group(2).isalpha()
        else m.group(0)
        for m in p.finditer(doc)
    )
Grismar
  • 27,561
  • 4
  • 31
  • 54
  • Thanks for this regex example. It's more robust than what I came up with. (Some of the loop logic issue you mention don't apply with my specific project, but important to keep in mind for broader applications, so I appreciate it.) – theotechne Jun 13 '21 at 01:44