I am getting a killed error when running a python script. I will post the code below. My first debugging step was to look at the /var/log/syslog file to see whether there are any memory issues reported, but I could not find anything related to the killed event. So I run my code again using the python -m trace --trace flags. This clearly points to the gzip library as the last lines are
...
gzip.py(271): self.offset += size
gzip.py(272): return chunk
--- modulename: gzip, funcname: read
gzip.py(242): self._check_closed()
--- modulename: gzip, funcname: _check_closed
gzip.py(154): if self.closed:
--- modulename: gzip, funcname: closed
gzip.py(362): return self.fileobj is None
gzip.py(243): if self.mode != READ:
gzip.py(247): if self.extrasize <= 0 and self.fileobj is None:
gzip.py(250): readsize = 1024
gzip.py(251): if size < 0: # get the whole thing
gzip.py(259): try:
gzip.py(260): while size > self.extrasize:
gzip.py(261): self._read(readsize)
--- modulename: gzip, funcname: _read
gzip.py(279): if self.fileobj is None:
gzip.py(282): if self._new_member:
gzip.py(301): buf = self.fileobj.read(size)
Killed
I don't have much experience with this, so I don't quite know why the read command would fail like this. Another point I noticed is that the code now fails when reading files which it processed without trouble before, so its not a problem with the code itself, but it seems a resource issue?
Here is the code
def fill_pubmed_papers_table(list_of_files):
for i, f in enumerate(list_of_files):
print "read file %d names %s" % (i, f)
inF = gzip.open(f, 'rb')
tree = ET.parse(inF)
inF.close()
root = tree.getroot()
papers = root.findall('PubmedArticle')
print "number of papers = ", len(papers)
# we don't need anything from root anymore
root.clear()
for citation in papers:
write_to_db(citation)
# If I do not release memory here I get segfault on the linux server
del papers
return
EDIT: thanks to jordanm I found that indeed there is a memory issue... dmesg gives
[127692.401983] Out of memory: Kill process 29035 (python) score 307 or sacrifice child
[127692.401984] Killed process 29035 (python) total-vm:1735776kB, anon-rss:1281708kB, file-rss:0kB
[127693.132028] INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 0, t=6045672 jiffies, g=3187114, c=3187113, q=0)
[127693.132028] INFO: Stall ended before state dump start
Does anybody know how use less memory to parse an xml file? I am a bit puzzled that the gzip command seems to cause the error?