Process data, much larger than physical memory, in chunks

Question

I need to process some data that is a few hundred times bigger than RAM. I would like to read in a large chunk, process it, save the result, free the memory and repeat. Is there a way to make this efficient in python?

Possible duplicate: http://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python — David Cain, Jul 17 '13 at 21:40
Check out pandas and pytables / hdf or hadoop streaming with python. If your on Linux you can use dumbo facilitate the hadoop python interaction. Python has a strong and vibrant community for data analysis; it's hard to miss with a Google search. — agconti, Jul 17 '13 at 21:42
Not a dup, but also related: [Python file iterator over a binary file with newer idiom](http://stackoverflow.com/questions/4566498/python-file-iterator-over-a-binary-file-with-newer-idiom/4566523#4566523). — abarnert, Jul 17 '13 at 22:08
Also see [Why doesn't Python's mmap work with large files?](http://stackoverflow.com/questions/1661986/why-doesnt-pythons-mmap-work-with-large-files). It's not directly related, but it has some useful discussion on sliding mmap windows, and how `mmap` is different from `read` under the covers, and so on. — abarnert, Jul 17 '13 at 22:10
@2rs2ts: Python very definitely can be efficient compared to C. The tiny overhead of the interpreter compared to the cost of `read`ing or `mmap`ping gigabytes is nothing. And the fact that there are nicely-tuned implementations of algorithms that you'd have to write yourself in C means that it's often _more_ efficient. — abarnert, Jul 17 '13 at 22:30
@abarnert If you take into account the fact that most people have trouble writing C well, then yes, it's often more efficient in Python. I think we both know what each other means. — 2rs2ts, Jul 17 '13 at 22:32
@2rs2ts: I don't think I actually do know what you mean. (If I _do_, then you're just wrong, and I don't think that's it.) In what way is reading 8KB at a time or mapping 8MB at a time less efficient in Python than in C? — abarnert, Jul 17 '13 at 22:36
@abarnert I meant that when it comes to Python operations which aren't directly implemented in C (which I assume `read` and `mmap` are, and the "process it" part of OP's question isn't), there is a performance hit *when compared to well-written and optimized C*. But this is par for the course, no? At any rate, it was only a smart aleck comment and of little consequence. — 2rs2ts, Jul 17 '13 at 23:03

abarnert · Accepted Answer · 2013-07-17T22:05:56.797

The general key is that you want to process the file iteratively.

If you're just dealing with a text file, this is trivial: for line in f: only reads in one line at a time. (Actually it buffers things up, but the buffers are small enough that you don't have to worry about it.)

If you're dealing with some other specific file type, like a numpy binary file, a CSV file, an XML document, etc., there are generally similar special-purpose solutions, but nobody can describe them to you unless you tell us what kind of data you have.

But what if you have a general binary file?

First, the read method takes an optional max bytes to read. So, instead of this:

data = f.read()
process(data)

You can do this:

while True:
    data = f.read(8192)
    if not data:
        break
    process(data)

You may want to instead write a function like this:

def chunks(f):
    while True:
        data = f.read(8192)
        if not data:
            break
        yield data

Then you can just do this:

for chunk in chunks(f):
    process(chunk)

You could also do this with the two-argument iter, but many people find that a bit obscure:

for chunk in iter(partial(f.read, 8192), b''):
    process(chunk)

Either way, this option applies to all of the other variants below (except for a single mmap, which is trivial enough that there's no point).

There's nothing magic about the number 8192 there. You generally do want a power of 2, and ideally a multiple of your system's page size. beyond that, your performance won't vary that much whether you're using 4KB or 4MB—and if it does, you'll have to test what works best for your use case.

Anyway, this assumes you can just process each 8K at a time without keeping around any context. If you're, e.g., feeding data into a progressive decoder or hasher or something, that's perfect.

But if you need to process one "chunk" at a time, your chunks could end up straddling an 8K boundary. How do you deal with that?

It depends on how your chunks are delimited in the file, but the basic idea is pretty simple. For example, let's say you use NUL bytes as a separator (not very likely, but easy to show as a toy example).

data = b''
while True:
    buf = f.read(8192)
    if not buf:
        process(data)
        break
    data += buf
    chunks = data.split(b'\0')
    for chunk in chunks[:-1]:
        process(chunk)
    data = chunks[-1]

This kind of code is very common in networking (because sockets can't just "read all", so you always have to read into a buffer and chunk into messages), so you may find some useful examples in networking code that uses a protocol similar to your file format.

Alternatively, you can use mmap.

If your virtual memory size is larger than the file, this is trivial:

with mmap.mmap(f.fileno(), access=mmap.ACCESS_READ) as m:
    process(m)

Now m acts like a giant bytes object, just as if you'd called read() to read the whole thing into memory—but the OS will automatically page bits in and out of memory as necessary.

If you're trying to read a file too big to fit into your virtual memory size (e.g., a 4GB file with 32-bit Python, or a 20EB file with 64-bit Python—which is only likely to happen in 2013 if you're reading a sparse or virtual file like, say, the VM file for another process on linux), you have to implement windowing—mmap in a piece of the file at a time. For example:

windowsize = 8*1024*1024
size = os.fstat(f.fileno()).st_size
for start in range(0, size, window size):
    with mmap.mmap(f.fileno(), access=mmap.ACCESS_READ, 
                   length=windowsize, offset=start) as m:
        process(m)

Of course mapping windows has the same issue as reading chunks if you need to delimit things, and you can solve it the same way.

But, as an optimization, instead of buffering, you can just slide the window forward to the page containing the end of the last complete message, instead of 8MB at a time, and then you can avoid any copying. This is a bit more complicated, so if you want to do it, search for something like "sliding mmap window", and write a new question if you get stuck.

I commend you for giving such a well-thought-out answer to such a broad question. Seriously, +1. — 2rs2ts, Jul 17 '13 at 22:15
Thanks! In my case I would like a chunk to be the size of RAM for efficiency reasons. Can you do that without trial and error? — marshall, Jul 18 '13 at 06:25
@marshall: You don't really want it to be the size of (physical) RAM, because some of that RAM is needed by the rest of your interpreter space, the kernel, other processes, the disk cache, etc. Also, once you get the chunk large enough, there's not much more gain to be had; if your code is as close to fully pipelined with the disk DMAs as possible, larger reads can't help. You can (and should) test it yourself, but usually the sweet spot will be somewhere between 4KB and 8MB, not anywhere near the limit of physical memory. — abarnert, Jul 31 '13 at 16:24
@marshall: Meanwhile, if you _do_ want the physical RAM size for some reason, there's no cross-platform way to do it, but you can always read from the `/proc` filesystem for linux, `ctypes` to `sysctlbyname` for most other *nix systems, `win2api` to `GlobalMemoryStatusEx` for Windows, etc. — abarnert, Jul 31 '13 at 17:01
Any pointers on how to decide whether 8192 bytes is a good number or not? — Nemo, May 06 '18 at 08:15
@Nemo Test it with 8192, then test it with 4096 and 16384 and see if either one is faster? There really is no “one size fits all” value, and anything you read will be wrong on a different system, or a few years later. — abarnert, May 06 '18 at 15:51

Process data, much larger than physical memory, in chunks

1 Answers1

Linked