There are two basic ways to approach this:
First, you can write a read_column
function with its own explicit buffer, either as a generator function:
def column_reader(fp):
buf = ''
while True:
col_and_buf = self.buf.split(',', 1)
while len(col_and_buf) == 1:
buf += fp.read(4096)
col_and_buf = buf.split(',', 1)
col, buf = col_and_buf
yield col
… or as a class:
class ColumnReader(object):
def __init__(self, fp):
self.fp, self.buf = fp, ''
def next(self):
col_and_buf = self.buf.split(',', 1)
while len(col_and_buf) == 1:
self.buf += self.fp.read(4096)
col_and_buf = self.buf.split(',', 1)
self.buf = buf
return col
But, if you write a read_until
function that handles the buffering internally, then you can just do this:
next_col = read_until(fp, ',')[:-1]
There are multiple read_until
recipes on ActiveState.
Or, if you mmap
the file, you effectively get this for free. You can just treat the file as a huge string and use find
(or regular expressions) on it. (This assumes the entire file fits within your virtual address space—probably not a problem in 64-bit Python builds, but in 32-bit builds, it can be.)
Obviously these are incomplete. They don't handle EOF, or newline (in real life you probably have six rows of a million columns, not one, right?), etc. But this should be enough to show the idea.