I am loading 12 XML files (30-80MB each) in a Python script:
import xml.etree.ElementTree as ET
files = ['1.xml', '2.xml', ..., '11.xml', '12.xml']
trees = [ET.parse(f) for f in files]
This takes around 50 seconds to run. I'll be running it a few times so I thought I would try to speed it up with multiprocessing:
import multiprocessing
trees = [None] * len(files)
def _parse_(i):
return (i, ET.parse(files[i]))
def _save_((i, tree)):
trees[i] = tree
def concurrent_parse():
pool = multiprocessing.Pool()
for i in range(len(files)):
pool.apply_async(func=_parse_, args=(i,), callback=_save_)
pool.close()
pool.join()
This now runs in 30s, which is a nice improvement. However, I am running all of these from the shell, and then working on the data interactively. After the first non-concurrent version completes, Python's memory usage is at a cool 1.73GB. After the concurrent one, the memory usage is at 2.57GB.
I am new to using multiprocessing, so please forgive me if I have missed something basic. But all other problems with losing memory after using Pool point to a failure to call close() which I am doing.
PS - if this is a really dumb way to load 12 XML files please feel free to say so.