21

I am reading an 800 GB xml file in python 2.7 and parsing it with an etree iterative parser.

Currently, I am just using open('foo.txt') with no buffering argument. I am a little confused whether this is the approach I should take or I should use a buffering argument or use something from io like io.BufferedReader or io.open or io.TextIOBase.

A point in the right direction would be much appreciated.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Mike S
  • 935
  • 2
  • 6
  • 18

3 Answers3

17

The standard open() function already, by default, returns a buffered file (if available on your platform). For file objects that is usually fully buffered.

Usually here means that Python leaves this to the C stdlib implementation; it uses a fopen() call (wfopen() on Windows to support UTF-16 filenames), which means that the default buffering for a file is chosen; on Linux I believe that would be 8kb. For a pure-read operation like XML parsing this type of buffering is exactly what you want.

The XML parsing done by iterparse reads the file in chunks of 16384 bytes (16kb).

If you want to control the buffersize, use the buffering keyword argument:

open('foo.xml', buffering=(2<<16) + 8)  # buffer enough for 8 full parser reads

which will override the default buffer size (which I'd expect to match the file block size or a multiple thereof). According to this article increasing the read buffer should help, and using a size at least 4 times the expected read block size plus 8 bytes is going to improve read performance. In the above example I've set it to 8 times the ElementTree read size.

The io.open() function represents the new Python 3 I/O structure of objects, where I/O has been split up into a new hierarchy of class types to give you more flexibility. The price is more indirection, more layers for the data to have to travel through, and the Python C code does more work itself instead of leaving that to the OS.

You could try and see if io.open('foo.xml', 'rb', buffering=2<<16) is going to perform any better. Opening in rb mode will give you a io.BufferedReader instance.

You do not want to use io.TextIOWrapper; the underlying expat parser wants raw data as it'll decode your XML file encoding itself. It would only add extra overhead; you get this type if you open in r (textmode) instead.

Using io.open() may give you more flexibility and a richer API, but the underlying C file object is opened using open() instead of fopen(), and all buffering is handled by the Python io.BufferedIOBase implementation.

Your problem will be processing this beast, not the file reads, I think. The disk cache will be pretty much shot anyway when reading a 800GB file.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Will ElementTree actually work? Won’t it try to put the whole tree into memory? – poke Feb 13 '13 at 21:24
  • 3
    @poke: That is what `iterparse()` is *for*. It gives you event-driven parsing with the ElementTree API, so you can free elements again as needed. – Martijn Pieters Feb 13 '13 at 21:34
  • So can you clarify what the difference is between using open() and io.open() in this case? The difference between file and io.TextIOWrapper (since this is what io.open returns)? Also could you explain what you mean by _usually_ fully buffered? Do I have to open it as 'rb' for this since I read that a text file is line buffered? – Mike S Feb 13 '13 at 21:52
  • 1
    @MikeS: I'm out of time tonight; that'll have to wait until the morning. Generally, there is no need to add extra layering here, and `open()` on a file will be fully buffered; ttys (your terminal) usually is linebuffered; `b` binary mode does not make a difference there. I'll suss out exactly what 'generally' means tomorrow. – Martijn Pieters Feb 13 '13 at 22:23
  • @MikeS: There, expanded to cover all aspects of buffering and `open()` vs. `io.open()`. I'd use the first. – Martijn Pieters Feb 14 '13 at 07:48
  • Thanks for the detailed answer. I went ahead and marked it as accepted. The only clarification I wanted was you said that the 8kb Linux buffer was _exactly_ what I wanted, and then you went on to say I might want to increase the buffer to at least 4x+8. Can you explain this discrepancy? – Mike S Feb 14 '13 at 17:34
  • @MikeS: No, I meant that you really want OS buffering, when doing sequential reads. The tuning of the buffer size is optional. :-) – Martijn Pieters Feb 14 '13 at 17:35
1

Have you tried a lazy function?: Lazy Method for Reading Big File in Python?

this seems to already answer your question. However, I would consider using this method to write your data to a DATABASE, mysql is free: http://dev.mysql.com/downloads/ , NoSQL is also free and might be a little more tailored to operations involving writing 800gb of data, or similar amounts: http://www.oracle.com/technetwork/database/nosqldb/downloads/default-495311.html

Community
  • 1
  • 1
RandomUs1r
  • 4,010
  • 1
  • 24
  • 44
1

I haven't tried it with such epic xml files, but last time I had to deal with large (and relatively simple) xml files, I used a sax parser.

It basically gives you callbacks for each "event" and leaves it to you to store the data you need. You can give an open file so you don't have to read it in all at once.

JCash
  • 312
  • 1
  • 3
  • 10