file stream processing in python

Question

I've got a data file where each "row" is delimited by \n\n\n. My solution is to isolate those rows by first slurping the file, and then splitting rows:

 for row in slurped_file.split('\n\n\n'):
    ...

Is there an "awk-like" approach I could take to parse the file as a stream within Python 2.7.9 , and split lines according to a given string value ? Thanks.

Is there a specific reason the `file.read(num_bytes)` method doesn't work for you? Just trying to better understand the requirements. It seems a lazy-generator based on reading bytes into a buffer and yielding split strings would be ideal for this. — aruisdante, Feb 19 '15 at 17:48
There is a [bug/feature request](http://bugs.python.org/issue1152248) for such thing to be added into Python standard library; see also [this question](http://stackoverflow.com/questions/19600475/how-to-read-records-terminated-by-custom-separator-from-file-in-python), but there is an easier workaround too. — Antti Haapala -- Слава Україні, Feb 19 '15 at 18:06
The `\n\n\n` delimit large blocs of data (which will fit in memory, but I don't know in advance the size of those blocs). — user2105469, Feb 19 '15 at 18:09
Yes, three consecutive line feeds when parsing with `od -c`. — user2105469, Feb 24 '15 at 09:32

Antti Haapala -- Слава Україні · Accepted Answer · 2015-02-19T18:17:13.620

So there is no such thing in the standard library. But we can make a custom generator to iterate over such records:

def chunk_iterator(iterable):
    chunk = []
    empty_lines = 0
    for line in iterable:
        chunk.append(line)
        if line == '\n':
            empty_lines += 1
            if empty_lines == 2:
                yield ''.join(chunk[:-2])
                empty_lines, chunk = 0, []
        else:
            empty_lines = 0

    yield ''.join(chunk)

Use as:

with open('filename') as f:
    for chunk in chunk_iterator(f):
        ...

This will use the per-line iteration of file written in C in CPython and thus be faster than the general record separator solution.

file stream processing in python

1 Answers1