2

I am parsing big XMLs (~500MB) with the help of LXML library in Python. I have used BeautifulSoup with lxml-xml parser for small files. But when I came across huge XMLs, it was inefficient as it reads the whole file once, and then parses it.

I need to parse a XML to get root to leaf paths (except the outermost tag).
eg.

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE A>
<A>
    <B>
        <C>
            abc
        </C>
        <D>
            abd
        </D>
    </B>
</A>

Above XML should give keys and values as output (root to leaf paths).

A.B.C = abc
A.B.D = abd

Here's the code that I've written to parse it:
(ignore1 and ignore2 are the tags that need to be ignored, and tu.clean_text() is the function which will remove unnecessary characters)

def fast_parser(filename, keys, values, ignore1, ignore2):
    context = etree.iterparse(filename, events=('start', 'end',))

    path = list()
    i = 0
    lastevent = ""
    for event, elem in context:
        i += 1
        tag = elem.tag if "}" not in elem.tag else elem.tag.split('}', 1)[1]

        if tag == ignore1 or tag == ignore2:
            pass
        elif event == "start":
            path.append(tag)
        elif event == "end":
            if lastevent == "start":
                keys.append(".".join(path))
                values.append(tu.clean_text(elem.text))

            # free memory
            elem.clear()
            while elem.getprevious() is not None:
                del elem.getparent()[0]
            if len(path) > 0:
                path.pop()
        lastevent = event

    del context
    return keys, values

I have already referred the following article for parsing a large file ibm.com/developerworks/xml/library/x-hiperfparse/#listing4

Here's the screenshot of top command. Memory usage goes beyond 2 GB for a ~500 MB XML file. I suspect that memory is not getting freed. enter image description here

I have already gone through few StackOverflow questions. But it didn't help. Please advice.

Lakshmikant Deshpande
  • 826
  • 1
  • 12
  • 30

1 Answers1

1

I took the code from https://stackoverflow.com/a/7171543/131187, chopped out comments and print statements, and added a suitable func to get this. I wouldn't like to guess how much time it would take to process a 500 Mb file!

Even in writing func I have done nothing original, having adopted the original authors' use of the xpath expression, 'ancestor-or-self::*', to provide the absolute path that you want.

However, since this code conforms more closely to the original scripts it might not leak memory.

import lxml.etree as ET

input_xml = 'temp.xml'
for line in open(input_xml).readlines():
    print (line[:-1])

def mod_fast_iter(context, func, *args, **kwargs):
    for event, elem in context:
        func(elem, *args, **kwargs)
        elem.clear()
        for ancestor in elem.xpath('ancestor-or-self::*'):
            while ancestor.getprevious() is not None:
                del ancestor.getparent()[0]
    del context

def func(elem):
    content = '' if not elem.text else elem.text.strip()
    if content:
        ancestors = elem.xpath('ancestor-or-self::*')
        print ('%s=%s' % ('.'.join([_.tag for _ in ancestors]), content))

print ('\nResult:\n')
context = ET.iterparse(open(input_xml , 'rb'), events=('end', ))
mod_fast_iter(context, func)

Output:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE A>
<A>
    <B>
        <C>
            abc
        </C>
        <D>
            abd
        </D>
    </B>
</A

Result:

A.B.C=abc
A.B.D=abd
Bill Bell
  • 21,021
  • 5
  • 43
  • 58