I am parsing big XMLs (~500MB) with the help of LXML library in Python. I have used BeautifulSoup with lxml-xml parser for small files. But when I came across huge XMLs, it was inefficient as it reads the whole file once, and then parses it.
I need to parse a XML to get root to leaf paths (except the outermost tag).
eg.
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE A>
<A>
<B>
<C>
abc
</C>
<D>
abd
</D>
</B>
</A>
Above XML should give keys and values as output (root to leaf paths).
A.B.C = abc
A.B.D = abd
Here's the code that I've written to parse it:
(ignore1 and ignore2 are the tags that need to be ignored, and tu.clean_text() is the function which will remove unnecessary characters)
def fast_parser(filename, keys, values, ignore1, ignore2):
context = etree.iterparse(filename, events=('start', 'end',))
path = list()
i = 0
lastevent = ""
for event, elem in context:
i += 1
tag = elem.tag if "}" not in elem.tag else elem.tag.split('}', 1)[1]
if tag == ignore1 or tag == ignore2:
pass
elif event == "start":
path.append(tag)
elif event == "end":
if lastevent == "start":
keys.append(".".join(path))
values.append(tu.clean_text(elem.text))
# free memory
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
if len(path) > 0:
path.pop()
lastevent = event
del context
return keys, values
I have already referred the following article for parsing a large file ibm.com/developerworks/xml/library/x-hiperfparse/#listing4
Here's the screenshot of top command. Memory usage goes beyond 2 GB for a ~500 MB XML file. I suspect that memory is not getting freed.
I have already gone through few StackOverflow questions. But it didn't help. Please advice.