I have a Python script which does a big parsing job on a Word .docx file (and it has worked on all previous files). It suddenly stops working mid-run without throwing any exceptions. The entire code is wrapped inside a try-except-clause, basically like this:
try:
report.read_and_postprocess() #: My parsing process (with ALOT of code inside)
print("Finished...")
except NoObjectForElementError as error:
print("Known error")
raise error
except:
print("Other error")
raise
finally:
print("Write this!")
To locate the error, I have tried using print()
statements to isolate where the code stops. However, when I get close, the stop point moves to a different place.
This makes it seem to me that there is a memory leak, or perhaps some other limitation to my Python environment which is kicking in - but I'm not very skilled in debugging this. I looked at the memory and CPU usage of my script. In the end of the script it only consumes about 87MB memory, making the computer's total memory move only from 67% to 68%. (I only look at the Windows 10 task manager for this, so I can't guarantee that memory doesn't momentarily spike before the script stops)
I have rerun the script hundreds of times, and never got an error message, except twice with this error:
Python Fatal Error: GC Object already Tracked
I have no clue how to continue debugging this, does someone have any tips? Is it likely caused by memory, or how can I find out? Is there some other Python limitation that could cause this? (I have read about a 20-limit of nested for-loops, for instance - but that should not be the case for me)
Update: It stops during the report.read_and_postprocess()
line.
Info about environment: Windows 10, Anaconda 3, Python 3.7. Script runs in the anaconda prompt, with my environment activated.
Update: I found this tip: https://stackoverflow.com/a/3443779/6700475 It may seem like my code is spending large amounts of time evaluating regular expressions. Basically lines like these were common while watching the trace:
sre_compile.py(596): return isinstance(obj, (str, bytes))
sre_compile.py(763): pattern = p
sre_compile.py(764): p = sre_parse.parse(p, flags)
--- modulename: sre_parse, funcname: parse
sre_parse.py(922): source = Tokenizer(str)
--- modulename: sre_parse, funcname: __init__
sre_parse.py(225): self.istext = isinstance(string, str)
sre_parse.py(226): self.string = string
sre_parse.py(227): if not self.istext:
sre_parse.py(229): self.decoded_string = string
sre_parse.py(230): self.index = 0
sre_parse.py(231): self.next = None
sre_parse.py(232): self.__next()
--- modulename: sre_parse, funcname: __next
sre_parse.py(234): index = self.index
sre_parse.py(235): try:
sre_parse.py(236): char = self.decoded_string[index]
sre_parse.py(240): if char == "\\":
sre_parse.py(247): self.index = index + 1
sre_parse.py(248): self.next = char
sre_parse.py(924): if pattern is None:
sre_parse.py(925): pattern = Pattern()
--- modulename: sre_parse, funcname: __init__
sre_parse.py(77): self.flags = 0
sre_parse.py(78): self.groupdict = {}
I ran the trace and (at least this time) it has stopped during a re.match, specifically this one - on the 3rd iteration of the loop:
def is_numeric(text):
""" Return whether a trimmed string is numeric
Numeric formats:
1
1.2 (US style)
1,2 (EU style)
1,200,340 (US style)
1 200 340 (other style)
1.200.340 (eu style?)
1,200,340.67
1 200 340,67
1 200 340.67
1.200.340,67
-23
-23.98
-2 454 981.21
+ 24
- 24
+ 24.9183
12321
543525,-
123123,
12389.
12 489.
12 432,
"""
if len(text) == 0:
return False
#: Try float
try:
float(text)
except ValueError:
pass
except:
raise
else:
return True
#: look for all characters that should not be in a number
if not re.match(r"^[-+0-9., ]*$", text):
return False
#: Verify numeric format
#: 1.200.200,78
#: 1,200,200.78
#: 1 200 200,78
#: 1 200 200.78
#: 1200200,78
#: 1200200.78
#: - 1200200.78
#: + 1.200.200,78
#: 1200200,-
#: -1 200 200,-
#: etc.
variants = ((r",", r"\."),
(r"\.", r","),
(r" ", r","),
(r" ", r"\."))
for (tho, dec) in variants:
dec_exp_opt = r"(%s[0-9]*)" % dec
if dec == ",":
dec_exp_opt = r"((%s)|(,-))" % dec_exp_opt
threesep = r"[1-9][0-9]{0,2}(%s[0-9]{3})*(%s)?" % (tho, dec_exp_opt)
nullsep = r"(([0-9]*(%s[0-9]+)?)|([0-9]+(%s)?))" % (dec, dec_exp_opt)
expr = r"^([-+][ \t]*)?((%s)|(%s))$" % (threesep, nullsep)
test = re.match(expr, text) #: IT HAS STOPPED HERE IN ITERATION 3!!
if test:
return True
return False
It may (or may not) be random that the expression it is trying to solve at this time is: ^([-+][ \t]*)?(([1-9][0-9]{0,2}( [0-9]{3})*((((,[0-9]*))|(,-)))?)|((([0-9]*(,[0-9]+)?)|([0-9]+((((,[0-9]*))|(,-)))?))))$
for the value 2017-05-29
(which should preferably return false).
Does it make sense that (poor?) regular expressions can cause the script to just stop without raising an exception? Is there some caching in the re
module that could cause this?