parsing large compressed xml files, python

Question

file  = BZ2File(SOME_FILE_PATH)
p = xml.parsers.expat.ParserCreate()
p.Parse(file)

Here's code that tries to parse xml file compressed with bz2. Unfortunately it fails with a message:

TypeError: Parse() argument 1 must be string or read-only buffer, not bz2.BZ2File

Is there a way to parse on the fly compressed bz2 xml files?

Note: p.Parse(file.read()) is not an option here. I want to parse a file which is larger than available memory, so I need to have a stream.

Nick · Accepted Answer · 2009-12-03T21:56:04.400

5

Just use p.ParseFile(file) instead of p.Parse(file).

Parse() takes a string, ParseFile() takes a file handle, and reads the data in as required.

Ref: http://docs.python.org/library/pyexpat.html#xml.parsers.expat.xmlparser.ParseFile

edited Dec 03 '09 at 21:56

answered Dec 03 '09 at 21:47

Nick

321
1
4

score 1 · Answer 2 · answered Dec 03 '09 at 21:28

1

Use .read() on the file object to read in the entire file as a string, and then pass that to Parse?

file  = BZ2File(SOME_FILE_PATH)
p = xml.parsers.expat.ParserCreate()
p.Parse(file.read())

answered Dec 03 '09 at 21:28

Amber

507,862
82
626
550

Nice try, but no. I updated the question to now have the obvious (for me, but not for you) fact, that parsed file will be huge. – Marcin Dec 03 '09 at 21:40
Alright, with the update to the question then yes, Nick's answer is definitely the right one. :) – Amber Dec 04 '09 at 03:54

score 0 · Answer 3 · answered Dec 03 '09 at 21:42

Can you pass in an mmap()'ed file? That should take care of automatically paging the needed parts of the file in, and avoid memory overflow. Of course if expat builts a parse tree, it might still run out of memory.

http://docs.python.org/library/mmap.html

Memory-mapped file objects behave like both strings and like file objects. Unlike normal string objects, however, these are mutable. You can use mmap objects in most places where strings are expected; for example, you can use the re module to search through a memory-mapped file.

parsing large compressed xml files, python

3 Answers3