Performance problem with python elementTree XML parser

Question

I have a memory problem with parsing the large XML file.

The file looks like (just first few rows):

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE raml SYSTEM 'raml20.dtd'>
<raml version="2.0" xmlns="raml20.xsd">
  <cmData type="actual">
    <header>
      <log dateTime="2019-02-05T19:00:18" action="created" appInfo="ActualExporter">InternalValues are used</log>
    </header>
    <managedObject class="MRBTS" version="MRBTS17A_1701_003" distName="PL/M-1" id="366">
      <p name="linkedMrsiteDN">PL/TE-2/p>
      <p name="name">Name of street</p>
      <list name="PiOptions">
        <p>0</p>
        <p>5</p>
        <p>2</p>
        <p>6</p>
        <p>7</p>
        <p>3</p>
        <p>9</p>
        <p>10</p>
      </list>
      <p name="btsName">4251</p>
      <p name="spareInUse">1</p>
    </managedObject>
    <managedObject class="MRBTS" version="MRBTS17A_1701_003" distName="PL/M10" id="958078">
      <p name="linkedMrsiteDN">PLMN-PLMN/MRSITE-138</p>
      <p name="name">Street 2</p>
      <p name="btsName">748</p>
      <p name="spareInUse">3</p>
    </managedObject>
    <managedObject class="MRBTS" version="MRBTS17A_1701_003" distName="PL/M21" id="1482118">
      <p name="name">Stree 3</p>
      <p name="btsName">529</p>
      <p name="spareInUse">4</p>
    </managedObject>
  </cmData>
</raml>

And I am using xml eTree Element parser, but with a file over 4GB and 32 GB of RAM on machine, I'm running out of memory. Code I'm using:

def parse_xml(data, string_in, string_out):
    """
    :param data: xml raw file that need to be processed and prased
    :param string_in: string that should exist in distinguish name
    :param string_out: string that should not exist in distinguish name
    string_in and string_out represent the way to filter level of parsing (site or cell)
    :return: dictionary with all unnecessary objects for selected technology
    """
    version_dict = {}
    for child in data:
        for grandchild in child:
            if isinstance(grandchild.get('distName'), str) and string_in in grandchild.get('distName') and string_out not in grandchild.get('distName'):
                inner_dict = {}
                inner_dict.update({'class': grandchild.get('class')})
                inner_dict.update({'version': grandchild.get('version')})
                for grandgrandchild in grandchild:
                    if grandgrandchild.tag == '{raml20.xsd}p':
                        inner_dict.update({grandgrandchild.get('name'): grandgrandchild.text})
                    elif grandgrandchild.tag == '{raml20.xsd}list':
                        p_lista = []
                        for gggchild in grandgrandchild:
                            if gggchild.tag == '{raml20.xsd}p':
                                p_lista.append(gggchild.text)
                            inner_dict.update({grandgrandchild.get('name'): p_lista})
                            if gggchild.tag == '{raml20.xsd}item':
                                for gdchild in gggchild:
                                    inner_dict.update({gdchild.get('name'): gdchild.text})
                    version_dict.update({grandchild.get('distName'): inner_dict})
    return version_dict

I have tried with iterparse, with root.clear(), but nothing really helps. I heard that DOM parsers are slower ones, but SAX gives me an error:

ValueError: unknown url type: '/development/data/raml20.dtd'

Not sure why. If anyone has any suggestion on how to improve way and performance, I will be really thankful. I there is a need for bigger XML samples, I am willing to provide it.

Thanks in advance.

EDIT:

Code I tried after the first answer:

import xml.etree.ElementTree as ET
def parse_item(d):
#     print(d)
#     print('---')

    a = '<root>'+ d + '</root>'
    tree = ET.fromstring(a)
    outer_dict_yield = {}
    for elem in tree:
        inner_dict_yield = {}
        for el in elem:
            if isinstance(el.get('name'), str):
                inner_dict_yield.update({el.get('name'): el.text})
            inner_dict.update({'version': elem.get('version')})
#                 print (inner_dict_yield)
    outer_dict_yield.update({elem.get('distName'): inner_dict_yield})
#     print(outer_dict_yield)
    return outer_dict_yield


def read_a_line(file_object):
    while True:
        data = file_object.readline()
        if not data:
            break
        yield data


min_data = ""
inside = False

f = open('/development/file.xml')
outer_main = {}
counter = 1
for line in read_a_line(f):
    if line.find('<managedObject') != -1:
        inside = True
    if inside:
        min_data += line
    if line.find('</managedObject') != -1:
        inside = False
        a = parse_item(min_data)
        counter = counter + 1
        outer_main.update({counter: a})
        min_data = ''

Try [Iterparse big XML, with low memory footprint](https://stackoverflow.com/posts/53883799/edit) - set `tag=['managedObject']` — stovfl, Apr 11 '19 at 12:26
@stovfl It doesn't really help me. Same problem, out of memory. :) But thanks anyway. — jovicbg, Apr 12 '19 at 08:47
*"Same problem, out of memory"*: Have you tried with **only** screen printing? Read [Iterparse For Large XML Files](https://stackoverflow.com/a/7171543/7414759) — stovfl, Apr 12 '19 at 10:57
do you have php-cli installed? does this also run out of memory? `php -r 'var_dump(@DOMDocument::loadXML("/development/file.xml"));'` - not important , i'm just curious — hanshenrik, May 09 '19 at 19:57

hoefling · Accepted Answer · 2019-05-13T15:58:47.807

If you only need to extract the data from the XML file and don't need to perform any XML-specific operations like XSL transformations etc, the approach with a very low memory footprint is to define your own TreeBuilder. Example:

import pathlib
from pprint import pprint
from xml.etree import ElementTree as ET


class ManagedObjectsCollector:
    def __init__(self):
        self.item_count = 0
        self.items = []
        self.curr_item = None
        self.attr_name = None
        self.list_name = None
        self.list_entry = False

    def start(self, tag, attr):
        if tag == '{raml20.xsd}managedObject':
            self.curr_item = dict()
            self.curr_item.update(**attr)
        elif tag == '{raml20.xsd}p':
            if self.list_name is None:
                self.attr_name = attr.get('name', None)
            self.list_entry = self.list_name is not None
        elif tag == '{raml20.xsd}list':
            self.list_name = attr.get('name', None)
            if self.list_name is not None:
                self.curr_item[self.list_name] = []

    def end(self, tag):
        if tag == '{raml20.xsd}managedObject':
            self.items.append(self.curr_item)
            self.curr_item = None
        elif tag == '{raml20.xsd}p':
            self.attr_name = None
            self.list_entry = False
        elif tag == '{raml20.xsd}list':
            self.list_name = None

    def data(self, data):
        if self.curr_item is None:
            return
        if self.attr_name is not None:
            self.curr_item[self.attr_name] = data
        elif self.list_entry:
            self.curr_item[self.list_name].append(data)

    def close(self):
        return self.items


if __name__ == '__main__':
    file = pathlib.Path('data.xml')
    with file.open(encoding='utf-8') as stream:
        collector = ManagedObjectsCollector()
        parser = ET.XMLParser(target=collector)
        ET.parse(stream, parser=parser)
    items = collector.items
    print('total:', len(items))
    pprint(items)

Running the above code with your example data will output:

total: 3
[{'PiOptions': ['0', '5', '2', '6', '7', '3', '9', '10'],
  'btsName': '4251',
  'class': 'MRBTS',
  'distName': 'PL/M-1',
  'id': '366',
  'linkedMrsiteDN': 'PL/TE-2',
  'name': 'Name of street',
  'spareInUse': '1',
  'version': 'MRBTS17A_1701_003'},
 {'btsName': '748',
  'class': 'MRBTS',
  'distName': 'PL/M10',
  'id': '958078',
  'linkedMrsiteDN': 'PLMN-PLMN/MRSITE-138',
  'name': 'Street 2',
  'spareInUse': '3',
  'version': 'MRBTS17A_1701_003'},
 {'btsName': '529',
  'class': 'MRBTS',
  'distName': 'PL/M21',
  'id': '1482118',
  'name': 'Stree 3',
  'spareInUse': '4',
  'version': 'MRBTS17A_1701_003'}]

Because we don't construct the XML tree in the ManagedObjectsCollector and don't keep more than the current file line in memory at a time, the memory allocation of the parser is minimal and the memory usage is largely affected by the collector.items list. The example above parses all the data from each managedObject item, so the list can grow pretty large. You can verify it by commenting out the self.items.append(self.curr_item) line - once the list doesn't grow, the memory usage remains constant (somewhat about 20-30 MiB, depending on your Python version).

If you need only parts of the data, you will benefit from a simpler implementation of TreeBuilder. For example, here's a TreeBuilder that only collects the version attributes, ignoring the rest of tags:

class VersionCollector:
    def __init__(self):
        self.items = []

    def start(self, tag, attr):
        if tag == '{raml20.xsd}managedObject':
            self.items.append(attr['version'])

    def close(self):
        return self.items

Bonus

Here is a self-contained script that is extended with memory usage measurements. You'll need some extra packages to be installed:

$ pip install humanize psutil tqdm

Optional: use lxml for faster parsing:

$ pip install lxml

Run the script with filename as parameter. Example output of a 40 MiB XML file:

$ python parse.py data_39M.xml
mem usage:   1%|▏    | 174641152/16483663872 [00:01<03:05, 87764892.80it/s, mem=174.6 MB]
total items memory size: 145.9 MB
total items count: 150603
[{'PiOptions': ['0', '5', '2', '6', '7', '3', '9', '10'],
  'btsName': '4251',
  'class': 'MRBTS',
  'distName': 'PL/M-1',
  'id': '366',
  'linkedMrsiteDN': 'PL/TE-2',
  'name': 'Name of street',
  'spareInUse': '1',
  'version': 'MRBTS17A_1701_003'},
  ...

Notice that for a 40MB XML file, the peak memory usage is at ~174 MB, while the memory allocation for the items list is ~146 MB; the rest is Python overhead and remains constant regardless of file size. This should give you a rough estimation of how much memory you'll need to read larger files.

Source code:

from collections import deque
import itertools
import pathlib
from pprint import pprint
import os
import sys
import humanize
import psutil
import tqdm

try:
    from lxml import etree as ET
except ImportError:
    from xml.etree import ElementTree as ET


def total_size(o, handlers={}, verbose=False):
    """https://code.activestate.com/recipes/577504/"""
    dict_handler = lambda d: itertools.chain.from_iterable(d.items())
    all_handlers = {
        tuple: iter,
        list: iter,
        deque: iter,
        dict: dict_handler,
        set: iter,
        frozenset: iter,
    }
    all_handlers.update(handlers)
    seen = set()
    default_size = sys.getsizeof(0)

    def sizeof(o):
        if id(o) in seen:
            return 0
        seen.add(id(o))
        s = sys.getsizeof(o, default_size)

        if verbose:
            print(s, type(o), repr(o), file=sys.stderr)

        for typ, handler in all_handlers.items():
            if isinstance(o, typ):
                s += sum(map(sizeof, handler(o)))
                break
        return s

    return sizeof(o)


class ManagedObjectsCollector:
    def __init__(self, mem_pbar):
        self.item_count = 0
        self.items = []
        self.curr_item = None
        self.attr_name = None
        self.list_name = None
        self.list_entry = False
        self.mem_pbar = mem_pbar
        self.mem_pbar.set_description('mem usage')

    def update_mem_usage(self):
        proc_mem = psutil.Process(os.getpid()).memory_info().rss
        self.mem_pbar.n = 0
        self.mem_pbar.update(proc_mem)
        self.mem_pbar.set_postfix(mem=humanize.naturalsize(proc_mem))

    def start(self, tag, attr):
        if tag == '{raml20.xsd}managedObject':
            self.curr_item = dict()
            self.curr_item.update(**attr)
        elif tag == '{raml20.xsd}p':
            if self.list_name is None:
                self.attr_name = attr.get('name', None)
            self.list_entry = self.list_name is not None
        elif tag == '{raml20.xsd}list':
            self.list_name = attr.get('name', None)
            if self.list_name is not None:
                self.curr_item[self.list_name] = []

    def end(self, tag):
        if tag == '{raml20.xsd}managedObject':
            self.items.append(self.curr_item)
            self.curr_item = None
        elif tag == '{raml20.xsd}p':
            self.attr_name = None
            self.list_entry = False
        elif tag == '{raml20.xsd}list':
            self.list_name = None

        # Updating progress bar costs resources, don't do it
        # on each item parsed or it will slow down the parsing
        self.item_count += 1
        if self.item_count % 10000 == 0:
            self.update_mem_usage()

    def data(self, data):
        if self.curr_item is None:
            return
        if self.attr_name is not None:
            self.curr_item[self.attr_name] = data
        elif self.list_entry:
            self.curr_item[self.list_name].append(data)

    def close(self):
        return self.items


if __name__ == '__main__':
    file = pathlib.Path(sys.argv[1])
    total_mem = psutil.virtual_memory().total
    with file.open(encoding='utf-8') as stream, tqdm.tqdm(total=total_mem, position=0) as pbar_total_mem:
        collector = ManagedObjectsCollector(pbar_total_mem)
        parser = ET.XMLParser(target=collector)
        ET.parse(stream, parser=parser)
    items = collector.items
    print('total:', len(items))
    print('total items memory size:', humanize.naturalsize(total_size(items)))
    pprint(items)

Hi, first of all, thank you very much for helping me. And it looks like code works, but if I include this: "elif tag == '{raml20.xsd}p': self.curr_attr_name = attr['name']" I am getting error: KeyError: 'name'. Do you know maybe what's the problem? — jovicbg, May 12 '19 at 21:05
`name` is the attribute of the `p` element, so e.g. for the element `
1
` the expression `attr['name']` will evaulate to `spareInUse`. If you get a `KeyError`, it means there is a `p` element without the `name` attribute somewhere in the XML file. You can circumvent this by replacing the line `self.curr_attr_name = attr['name']"` with e.g. `if 'name' in attr.keys(): self.curr_attr_name = attr['name']"` or even `self.curr_attr_name = attr.get('name', None)"` or similar. — hoefling, May 12 '19 at 22:02
Yeah, there can be a list. So can contain a tag called "item" which has a list of "p" tag inside. — jovicbg, May 12 '19 at 22:23
So example data you have posted doesn't reflect the actual data? Can you post a real snippet from the XML, reflecting the schema (or maybe even post the DTD)? The code in my answer works for the schema of the data from your question, but surely not for any arbitrary schema. — hoefling, May 12 '19 at 23:48
Yes, I'm sorry about that. The thing I need is to create a list (as a value in a dictionary) which will contain all
element values between and tag, and the key should be the name of list "PiOptions" for example. I handled that somehow in my non-efficient code but have a little trouble to include here. P.S. I have edited example of XML in my question. — jovicbg, May 13 '19 at 13:25
Ok, so basically, you need to extend the `start`, `end` and `data` methods with the case of having a `p` element inside a `list` element; the rest remains the same. I have updated the code in the answer so it consumes the updated XML schema you posted; please try it out. — hoefling, May 13 '19 at 16:01

score 1 · Answer 2 · answered May 09 '19 at 09:48

Can I ask a hackish question? Is the file flat? It seems like there are a few parent tags and then all the other tags are managedObject items, maybe you could write a custom parser by which you parse each tag out, treat it like an XML document, then discard it. Streaming through the file will allow you to alternately read, analyze and discard items, effectively conserving the memory that you are limited by.

Here's some sample code that will stream the file and allow you to process each chunk one by one. Replace parse_item with something that is useful to you.

def parse_item(d):
    print('---')
    print(d)
    print('---')


def read_a_line(file_object):
    while True:
        data = file_object.readline()
        if not data:
            break
        yield data


min_data = ""
inside = False

f = open('bigfile.xml')
for line in read_a_line(f):
    if line.find('<managedObject') != -1:
        inside = True
    if inside:
        min_data += line
    if line.find('</managedObject') != -1:
        inside = False
        parse_item(min_data)
        min_data = ''

I should also mention I was lazy and used the generator listed here to read the file (but I modified it a bit): Lazy Method for Reading Big File in Python?

Thanks a lot. I have tried with code I posted in question on smaller file of 1 GB and it looks like it works. Now I will try with file of 6.5 GB. — jovicbg, May 09 '19 at 11:34

Performance problem with python elementTree XML parser

2 Answers2

Bonus