reading very large file where format is newline independent

Question

My Python code supports reading and writing data in a file format created by others called the BLT format. The BLT format is white space and newline independent in that a newline is treated just like other white space. The primary entry in this format is a "ballot" which ends with a "0", e.g.,

1 2 3 0

Since the format is newline independent, it could also be written as

1 2
3 0

Or you could have multiple ballots on a line:

1 2 3 0 4 5 6 0

These files can be very large so I don't want to read an entire file into memory. Line-based reading is complicated since the data is not line-based. What is a good way to process these files in a memory-efficient way?

Are you interested in the plain BLT format or the extended (Open STV) one? Is the link that you gave the only specification for each format? — John Machin, Nov 11 '10 at 20:01
Full disclosure, I wrote the linked wiki page and devised the extended format. I'm interested in the original format where it is newline independent (which is not explained in the wiki page because I encourage people to use newlines). — new name, Nov 11 '10 at 20:30

kindall · Accepted Answer · 2010-11-12T17:49:56.560

For me, the most straightforward way to solve this is with generators.

def tokens(filename):
    with open(filename) as infile:
        for line in infile:
            for item in line.split():
                yield int(item)

def ballots(tokens):
    ballot = []
    for t in tokens:
        if t:
            ballot.append(t)
        else:
            yield ballot
            ballot = []

t = tokens("datafile.txt")

for b in ballots(t):
    print b

I see @katrielalex posted a generator-using solution while I was posting mine. The difference between ours is that I'm using two separate generators, one for the individual tokens in the file and one for the specific data structure you wish to parse. The former is passed to the latter as a parameter, the basic idea being that you can write a function like ballots() for each of the data structures you wish to parse. You can either iterate over everything yielded by the generator, or call next() on either generator to get the next token or ballot (be prepared for a StopIteration exception when you run out, or else write the generators to generate a sentinel value such as None when they run out of real data, and check for that).

It would be pretty straightforward to wrap the whole thing in a class. In fact...

class Parser(object):

    def __init__(self, filename):

        def tokens(filename):
            with open(filename) as infile:
                for line in infile:
                    for item in line.split():
                        yield int(item)

        self.tokens = tokens(filename)

    def ballots(self):
        ballot = []
        for t in self.tokens:
            if t:
                ballot.append(t)
            else:
                yield ballot
                ballot = []

p = Parser("datafile.txt")

for b in p.ballots():
    print b

This solution (two generators) is preferable to the one-generator answer, but doesn't go far enough. The file format has other structures besides the "ballot" structure. So you need one generator to parse the file into tokens (**which can include text in quotes**), and a recognizer (which may be a generator) per different structure. — John Machin, Nov 11 '10 at 19:58
Thanks for this answer! The lack of an off-the-shelf method for grabbing the next n items in a file really hurts data-munging in Python compared to the ease of R's `scan` or even Fortran's `read`. This implementation is very clever and efficient---but non-obvious to language newcomers. — Sharpie, Dec 23 '10 at 00:42

score 1 · Answer 2 · edited May 23 '17 at 12:19

1

Use a generator:

>>> def ballots(f):
...     ballots = []
...     for line in f:
...             for token in line.split():
...                     if token == '0':
...                             yield ballots
...                             ballots = []
...                     else:
...                             ballots.append(token)

This will read the file line by line, split on all whitespace, and append the tokens in the line one by one to a list. Whenever a zero is reached, that ballot is yielded and the list reset to empty.

edited May 23 '17 at 12:19

Community

1
1

answered Nov 11 '10 at 19:20

Katriel

120,462
19
136
170

I like the idea, but it seems that it would be slow to iterate over each char... I wonder if I can get memory efficiency and speed. – new name Nov 11 '10 at 19:26
You might try it before complaining about how slow it is. :-) It does read a line at a time, so it's not like you have an I/O call for each item being read. – kindall Nov 11 '10 at 19:30
It **doesn't** iterate over each character; `char` is just a very misleading name for a datum that can contain `"-2"` or `"10"` (according to the given link) or (presumably) `str(any_old_integer)` – John Machin Nov 11 '10 at 19:35
@katriealex: you also need to edit the last 2 sentences: "append the characters one by one" and "the list reset to *zero*" – John Machin Nov 11 '10 at 19:45
@John: I need to learn to read before I write... thanks for pointing that out =) – Katriel Nov 11 '10 at 20:15
@Katrie, I knew what you meant in your original post and it was clear that char wasn't actually a char but a token. – new name Nov 11 '10 at 20:31

reading very large file where format is newline independent

2 Answers2