6

I'm loading data from a bunch of XML files with lxml.etree, but I'd like to close them once I'm done with this initial parsing. Currently the XML_FILES list in the below code takes up 350 MiB of the program's 400 MiB of used memory. I've tried del XML_FILES, del XML_FILES[:], XML_FILES = None, for etree in XML_FILES: etree = None, and a few more, but none of these seem to be working. I also can't find anything in the lxml docs for closing an lxml file. Here's the code that does the parsing:

def open_xml_files():
    return [etree.parse(filename) for filename in paths]

def load_location_data(xml_files):
    location_data = {}

    for xml_file in xml_files:
        for city in xml_file.findall('City'):
            code = city.findtext('CityCode')
            name = city.findtext('CityName')
            location_data['city'][code] = name

        # [A few more like the one above]    

    return location_data

XML_FILES = utils.open_xml_files()
LOCATION_DATA = load_location_data(XML_FILES)
# XML_FILES never used again from this point on

Now, how do I get rid of XML_FILES here?

Underyx
  • 1,599
  • 1
  • 17
  • 23
  • How did you determine the memory usage? It might be that `del` does free the structure, but the memory is kept in the process by `malloc`. – Fred Foo Mar 13 '14 at 14:03
  • @larsmans I used [memory_profiler](https://pypi.python.org/pypi/memory_profiler). Here's an example output: http://hastebin.com/wigevoyafu.py – Underyx Mar 13 '14 at 14:10
  • `memory_profiler`, IIRC, suffers from the problem of measuring the process's memory use according to the kernel, rather than according to `malloc`. The process may be holding on to memory for later reuse. Try loading the XML, then `del`, then loading it again and check if the memory usage actually doubles. – Fred Foo Mar 13 '14 at 15:10
  • @larsmans It does not double: http://hastebin.com/tageriwuba.py Now, um, what exactly does that tell me, if you wouldn't mind explaining in a bit more detail? – Underyx Mar 13 '14 at 15:20

4 Answers4

4

The other solutions I found were very inefficient, but this worked for me:

def destroy_tree(tree):
    root = tree.getroot()

    node_tracker = {root: [0, None]}

    for node in root.iterdescendants():
        parent = node.getparent()
        node_tracker[node] = [node_tracker[parent][0] + 1, parent]

    node_tracker = sorted([(depth, parent, child) for child, (depth, parent)
                           in node_tracker.items()], key=lambda x: x[0], reverse=True)

    for _, parent, child in node_tracker:
        if parent is None:
            break
        parent.remove(child)

    del tree
PascalVKooten
  • 20,643
  • 17
  • 103
  • 160
3

You might consider etree.iterparse, which uses a generator rather than an in-memory list. Combined with a generator expression, this might save your program some memory.

def open_xml_files():
    return (etree.iterparse(filename) for filename in paths)

iterparse creates a generator over the parsed contents of the file, while parse immediately parses the file and loads the contents into memory. The difference in memory usage comes from the fact that iterparse doesn't actually do anything until its next() method is called (in this case, implicitly via a for loop).

EDIT: Apparently iterparse does work incrementally, but doesn't free memory as is parses. You could use the solution from this answer to free memory as you traverse the xml document.

Community
  • 1
  • 1
Emmett Butler
  • 5,969
  • 2
  • 29
  • 47
  • I was under the impression that this is mostly just syntax sugar in lxml's case, as I recall reading it still iterates through the whole file first. Are you sure this works any better in this regard? (It is very well possible that I'm just remembering this wrong.) – Underyx Mar 13 '14 at 14:13
  • 1
    Good point. Apparently `iterparse` does work incrementally, but doesn't free memory as is parses. You may want to check out [this answer](http://stackoverflow.com/a/12161078/735204) – Emmett Butler Mar 13 '14 at 14:17
  • Great, thanks, I'll check that out. Though, I wonder: wouldn't it be simpler to clear memory after we're done with the parsing? I can live with using this much while the parsing is going on, i.e. until the last line in the code I provided. If I could free up all memory used by `XML_FILES` there, that would be sufficient for me. – Underyx Mar 13 '14 at 14:23
  • `iterparse` builds the parse tree incrementally, but it does yield the full tree at the end unless you trim it every time you've processed a part. – Fred Foo Mar 13 '14 at 15:11
  • @EmmettJ.Butler After a great deal of refactoring I ended up using what linked in your comment, and it works great. Mind editing your answer to include that, so that I can accept it? Thanks! – Underyx Mar 14 '14 at 09:40
  • I've edited my answer to include the solution you used – Emmett Butler Mar 14 '14 at 12:33
3

Given that the memory usage does not double the second time the file is parsed, if the structure has been deleted in between the parses (see comments), here's what's happening:

  • LXML wants memory, so calls malloc.
  • malloc wants memory, so requests this from the OS.
  • del deletes the structure as far as Python and LXML are concerned. However, malloc's counterpart free does not actually give the memory back to the OS. Instead, it holds on to it to serve future requests.
  • The next time when LXML requests memory, malloc serves memory from the same region(s) that it got from the OS previously.

This is quite typical behavior for malloc implementations. memory_profiler only checks the process's total memory, including the parts reserved for reuse by malloc. With applications using big, contiguous chunks of memory (e.g. big NumPy arrays), that's fine because those are actually returned to the OS.(*) But for libraries like LXML that request lots of smaller allocations, memory_profiler will give an upper bound, not an exact figure.

(*) At least on Linux with Glibc. I'm not sure what MacOS and Windows do.

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
1

How about running the memory-consuming code as a separate process and leaving the task of releasing the memory to the operating system? In your case this should do the job:

from multiprocessing import Process, Queue

def get_location_data(q):
    XML_FILES = utils.open_xml_files()
    q.put(load_location_data(XML_FILES))

q = Queue()
p = Process(target=get_location_data, args=((q,)))
p.start()
result = q.get() # your location data
if p.is_alive():
    p.terminate()
Wojciech Walczak
  • 3,419
  • 2
  • 23
  • 24