5

I have a Python script which does a big parsing job on a Word .docx file (and it has worked on all previous files). It suddenly stops working mid-run without throwing any exceptions. The entire code is wrapped inside a try-except-clause, basically like this:

try:
    report.read_and_postprocess() #: My parsing process (with ALOT of code inside)
    print("Finished...")
except NoObjectForElementError as error:
    print("Known error")
    raise error
except:
    print("Other error")
    raise
finally:
    print("Write this!")

To locate the error, I have tried using print() statements to isolate where the code stops. However, when I get close, the stop point moves to a different place.

This makes it seem to me that there is a memory leak, or perhaps some other limitation to my Python environment which is kicking in - but I'm not very skilled in debugging this. I looked at the memory and CPU usage of my script. In the end of the script it only consumes about 87MB memory, making the computer's total memory move only from 67% to 68%. (I only look at the Windows 10 task manager for this, so I can't guarantee that memory doesn't momentarily spike before the script stops)

I have rerun the script hundreds of times, and never got an error message, except twice with this error:

Python Fatal Error: GC Object already Tracked

I have no clue how to continue debugging this, does someone have any tips? Is it likely caused by memory, or how can I find out? Is there some other Python limitation that could cause this? (I have read about a 20-limit of nested for-loops, for instance - but that should not be the case for me)

Update: It stops during the report.read_and_postprocess() line.

Info about environment: Windows 10, Anaconda 3, Python 3.7. Script runs in the anaconda prompt, with my environment activated.

Update: I found this tip: https://stackoverflow.com/a/3443779/6700475 It may seem like my code is spending large amounts of time evaluating regular expressions. Basically lines like these were common while watching the trace:

sre_compile.py(596):     return isinstance(obj, (str, bytes))
sre_compile.py(763):         pattern = p
sre_compile.py(764):         p = sre_parse.parse(p, flags)
 --- modulename: sre_parse, funcname: parse
sre_parse.py(922):     source = Tokenizer(str)
 --- modulename: sre_parse, funcname: __init__
sre_parse.py(225):         self.istext = isinstance(string, str)
sre_parse.py(226):         self.string = string
sre_parse.py(227):         if not self.istext:
sre_parse.py(229):         self.decoded_string = string
sre_parse.py(230):         self.index = 0
sre_parse.py(231):         self.next = None
sre_parse.py(232):         self.__next()
 --- modulename: sre_parse, funcname: __next
sre_parse.py(234):         index = self.index
sre_parse.py(235):         try:
sre_parse.py(236):             char = self.decoded_string[index]
sre_parse.py(240):         if char == "\\":
sre_parse.py(247):         self.index = index + 1
sre_parse.py(248):         self.next = char
sre_parse.py(924):     if pattern is None:
sre_parse.py(925):         pattern = Pattern()
 --- modulename: sre_parse, funcname: __init__
sre_parse.py(77):         self.flags = 0
sre_parse.py(78):         self.groupdict = {}

I ran the trace and (at least this time) it has stopped during a re.match, specifically this one - on the 3rd iteration of the loop:

def is_numeric(text):
    """ Return whether a trimmed string is numeric 

    Numeric formats:
        1
        1.2 (US style)
        1,2 (EU style)
        1,200,340 (US style)
        1 200 340 (other style)
        1.200.340 (eu style?)
        1,200,340.67
        1 200 340,67
        1 200 340.67
        1.200.340,67
        -23
        -23.98
        -2 454 981.21
        + 24
        - 24
        + 24.9183
        12321
        543525,-
        123123,
        12389.
        12 489.
        12 432,
    """
    if len(text) == 0:
        return False

    #: Try float
    try:
        float(text)
    except ValueError:
        pass
    except:
        raise
    else:
        return True

    #: look for all characters that should not be in a number
    if not re.match(r"^[-+0-9., ]*$", text):
        return False

    #: Verify numeric format
    #:  1.200.200,78
    #:  1,200,200.78
    #:  1 200 200,78
    #:  1 200 200.78
    #:  1200200,78
    #:  1200200.78
    #:  - 1200200.78
    #:  + 1.200.200,78
    #:  1200200,-
    #:  -1 200 200,-
    #:  etc.
    variants = ((r",", r"\."),
                (r"\.", r","),
                (r" ", r","),
                (r" ", r"\."))
    for (tho, dec) in variants:
        dec_exp_opt = r"(%s[0-9]*)" % dec
        if dec == ",":
            dec_exp_opt = r"((%s)|(,-))" % dec_exp_opt
        threesep = r"[1-9][0-9]{0,2}(%s[0-9]{3})*(%s)?" % (tho, dec_exp_opt)
        nullsep = r"(([0-9]*(%s[0-9]+)?)|([0-9]+(%s)?))" % (dec, dec_exp_opt)
        expr = r"^([-+][ \t]*)?((%s)|(%s))$" % (threesep, nullsep)

        test = re.match(expr, text) #: IT HAS STOPPED HERE IN ITERATION 3!!
        if test:
            return True

    return False

It may (or may not) be random that the expression it is trying to solve at this time is: ^([-+][ \t]*)?(([1-9][0-9]{0,2}( [0-9]{3})*((((,[0-9]*))|(,-)))?)|((([0-9]*(,[0-9]+)?)|([0-9]+((((,[0-9]*))|(,-)))?))))$ for the value 2017-05-29 (which should preferably return false).

Does it make sense that (poor?) regular expressions can cause the script to just stop without raising an exception? Is there some caching in the re module that could cause this?

Andreas O
  • 239
  • 1
  • 4
  • 15
  • How exactly you're running this script? I mean there is no main method? Are you sure that your try block is getting executed? – Birlla Feb 07 '19 at 11:19
  • Yeah, I'm running it as part of a larger script in Anaconda prompt. It definitely executes the try-block, since adding an assertion inside the script (before it crashes) is caught. – Andreas O Feb 07 '19 at 12:22
  • Can you make sure you are not hanging the regex engine with catastrophic backtracking https://www.regular-expressions.info/catastrophic.html – Mihai Andrei Feb 07 '19 at 17:49
  • i rand your code on the test string you have in the comments without any issues. Does your python process just die? In particular the regex that you mentioned as failing is not breaking my python. – Mihai Andrei Feb 07 '19 at 17:53
  • It does not break my Python either with just one try. So if the regex is the issue, it might be because of some accumulating effect. In this case, this regex is one of many regex'. I've tried to collect all the regex' run by that exact function, and got to 74 before it fails. I will look into the "runaway regular expressions". Thanks! – Andreas O Feb 07 '19 at 22:19
  • I have managed to locate the method which is at fault. When commenting it out, the entire script completed. The method typechecks contents of a table cell, using many regular expressions and dateutil parser. I will update with more information once I have solved the problem. In general, python's `trace` was a good way of debugging this. – Andreas O Feb 08 '19 at 15:36

2 Answers2

9

The issue was solved eventually. I ended up debugging the program using a tip from: How can I tell where my python script is hanging?

python -m trace --trace YOURSCRIPT.py

Using the trace module allowed me to locate where the script stopped functioning. In my case, I ran a loop where each iteration did a bunch of regular expression checks. The script bugged during these checks, but at different points each time. I'm not entirely sure, but Catastrophic Backtracking was a likely cause, as suggested in Mihai Andrei's answer. What is certain, is that I ran very inefficient regex checks. I have rewritten the entire regex part of my script, and now it works fine.

So in conclusion, the answer to my question is:

  1. Debug the script with the trace module to find out where it stops. This lead me to (2)
  2. Make sure to avoid Catastrophic Backtracking (and possible memory leaks in regular expressions[?])

If anyone have more in-depth explanations as to why Python bugs without an error message, or can confirm that memory leaks in regular expressions can cause this, please add to my answer!

Andreas O
  • 239
  • 1
  • 4
  • 15
2

You could run it with a debugger and pause execution when it hangs. See the stack trace. Tools like strace on linux might also help. Your process is likely blocked on some io, or it is stuck on some deadlock.

A user friendly debugger is in pycharm. Export you ipynb to a plain python file.

EDIT new info from question

Don't construct such a huge, unpredictable regex. Just check for all patterns in a loop.

You might have constructed a regex that is prone to catastrophic backtracking

I does not seem so at first glance but this needs a second glance

p = r"^([-+][ \t]*)?(([1-9][0-9]{0,2}( [0-9]{3})*((((,[0-9]*))|(,-)))?)|((([0-9]*(,[0-9]+)?)|([0-9]+((((,[0-9]*))|(,-)))?))))$"
Mihai Andrei
  • 1,024
  • 8
  • 11
  • With your tip I tried using the debugger in VSCode. However, I am not able to pause execution when it stops (it does not hang - it just stops). I have tried using python's trace module, and have updated my question with new information. Is there some other way to use a debugger to get the stack trace of e.g. the last piece of code before it stops? – Andreas O Feb 07 '19 at 16:32
  • i rand your code on the test string you have in the comments without any issues. Does your python process just die? – Mihai Andrei Feb 07 '19 at 17:47