3

I am loading 12 XML files (30-80MB each) in a Python script:

import xml.etree.ElementTree as ET
files = ['1.xml', '2.xml', ..., '11.xml', '12.xml']
trees = [ET.parse(f) for f in files]

This takes around 50 seconds to run. I'll be running it a few times so I thought I would try to speed it up with multiprocessing:

import multiprocessing    
trees = [None] * len(files)

def _parse_(i):
    return (i, ET.parse(files[i]))

def _save_((i, tree)):
    trees[i] = tree

def concurrent_parse():
    pool = multiprocessing.Pool()
    for i in range(len(files)):
        pool.apply_async(func=_parse_, args=(i,), callback=_save_)
    pool.close()
    pool.join()

This now runs in 30s, which is a nice improvement. However, I am running all of these from the shell, and then working on the data interactively. After the first non-concurrent version completes, Python's memory usage is at a cool 1.73GB. After the concurrent one, the memory usage is at 2.57GB.

I am new to using multiprocessing, so please forgive me if I have missed something basic. But all other problems with losing memory after using Pool point to a failure to call close() which I am doing.

PS - if this is a really dumb way to load 12 XML files please feel free to say so.

svick
  • 236,525
  • 50
  • 385
  • 514
Tim MB
  • 4,413
  • 4
  • 38
  • 48
  • 1
    I understand that question is about `multiprocessing`, and it's interested for me (upvoted, and subscribed). But consider using `lxml.etree` if you can. I have 4 generated test files 20 MB each. Test results `lxml/xml` (without multiprocessing): time - 1.47/27.95 sec; memory - 411/640 MB. – reclosedev Jan 08 '12 at 14:29

1 Answers1

2

I'm not certain this is actually a leak, the parallel implementation will need more memory to hold all the files simultaneously. Then python may be deleting objects but not returning the memory to the OS, which would look like python using more memory than it needs for the existing objects.
So what happens if you run concurrent_parse() several times? If the memory usage is constant then is isn't a leak. If the memory goes up after each run then it is a problem and you might want to look at this thread for information on tracing leaks - Python memory leaks.

Community
  • 1
  • 1
user1013341
  • 336
  • 1
  • 9
  • This is an appealing explanation but I'm not entirely convinced as the files are held simultaneously by separate Python processes, so memory used in the parsing should be returned to the OS. Rerunning concurrent_parse() grinds my machine down to a halt (I gave it about ten minutes) as the memory maxes out and it starts paging everything. If I rerun it but with only 2-4 files then the memory does seem to stabilize around 2GB. However, rerunning with 4-6 files sometimes works fine, other times hits the memory limit. Either way, `multiprocessing`'s maybe not the magic bullet I was hoping for! – Tim MB Jan 09 '12 at 09:48
  • Were you resetting `trees` to all None? This is important, as I have discovered, because the child processes get a copy of the objects from the main process, so if your trees has a lot of data that gets multiplied by the number of processes. After a bit of experimentation it looks like there is no increase in memory after running concurrent_parse() repeatedly so long as trees is reset between runs (at least with python 2.7 on CentOS 5). I would guess that the increase in memory usage when using multiprocess is due to IPC serialisation. – user1013341 Jan 10 '12 at 17:14
  • Hmm I see what you mean. I think you are probably right with your answer in that it is not a memory leak. But I'm not entirely satisfied as to why the original process ends up using an extra 700MB of memory when all of the copying of instances is into different processes. Either way, I'll lay it to rest as the Python garbage system is beyond the scope of the question. Thanks! – Tim MB Jan 12 '12 at 21:30