Parsing Very Large XML Files Using Multiprocessing

Question

I have a huge XML file, and I'm a tad bit at a loss on how to handle it. It's 60 GBs, and I need to read it.

I was thinking if there a way to use multiprocessing module to read the python file?

Does anyone have any samples of doing this that they could point me to?

Thank you

I'd suggest either of these existing answers: http://stackoverflow.com/questions/7171140/using-python-iterparse-for-large-xml-files or http://stackoverflow.com/questions/12612229/parsing-a-large-40gb-xml-text-file-in-python — Tim, Jan 27 '14 at 13:53

score 4 · Answer 1 · answered Jan 27 '14 at 13:52

4

For a file of that size, I suggest you use a streaming XML parser. In Python, this would be the iterparse method from cElementTree or lxml.etree:

http://effbot.org/zone/element-iterparse.htm

answered Jan 27 '14 at 13:52

jbaiter

6,913
4
30
40

5

Ok, so iterparse for reading the XML and how could I distribute this among many CPUs to speed up the reading of data? Order doesn't matter. – code base 5000 Jan 27 '14 at 16:26
I am also facing the same trouble. Is there any suggestion for speeding up? – Yiwei Jiang Jul 05 '23 at 10:34

score 4 · Answer 2 · edited May 23 '17 at 12:00

Save memory parsing very large XML files You could use this code which is a bit newer then the effbot.org one, it might save you more memory: Using Python Iterparse For Large XML Files

Multiprocessing / Multithreading If I remember correctly you can not do multiprocessing easily to speed up the proces when loading/parsing the XML. If this was an easy option everyone would probably already do it by default. Python in general uses a global interpreter lock (GIL) and this causes Python to run within one proces and this is bound to one core of your CPU. When threads are used they run in context of the main Python proces which is still bound to only one core. Using threads in Python can lead to a performance decrease due to the context switching. Running multiple Python processes on multiple cores brings the expected additional performance, but those do not share memory so you need inter proces communication (IPC) to have processes work together (you can use multiprocessing in a pool, they sync when the work is done but mostly useful for (not to) small tasks that are finite). Sharing memory is required I would assume as every task is working on the same big XML. LXML however has some way to work around the GIL but it only improves performance under certain conditions.

Threading in LXML For introducing threading in lxml there is a part in the FAQ that talks about this: http://lxml.de/FAQ.html#id1

Can I use threads to concurrently access the lxml API?

Short answer: yes, if you use lxml 2.2 and later.

Since version 1.1, lxml frees the GIL (Python's global interpreter lock) internally when parsing from disk and memory, as long as you use either the default parser (which is replicated for each thread) or create a parser for each thread yourself. lxml also allows concurrency during validation (RelaxNG and XMLSchema) and XSL transformation. You can share RelaxNG, XMLSchema and XSLT objects between threads

Does my program run faster if I use threads?

Depends. The best way to answer this is timing and profiling.

The global interpreter lock (GIL) in Python serializes access to the interpreter, so if the majority of your processing is done in Python code (walking trees, modifying elements, etc.), your gain will be close to zero. The more of your XML processing moves into lxml, however, the higher your gain. If your application is bound by XML parsing and serialisation, or by very selective XPath expressions and complex XSLTs, your speedup on multi-processor machines can be substantial.

See the question above to learn which operations free the GIL to support multi-threading.

Additional tips on optimizing performance for parsing large XML https://www.ibm.com/developerworks/library/x-hiperfparse/

Parsing Very Large XML Files Using Multiprocessing

2 Answers2

Linked

Related