1

So I am attempting to read in a large data file in python. If the data had one column and 1 million rows I would do:

fp = open(ifile,'r');

for row in fp:  
    process row

My problem arises when the data I am reading in has, say 1 million columns and only 1 row. What I would like is a similar functionality to the fscanf() function in C.

Namely,

while not EOF:  
    part_row = read_next(%lf)  
    work on part_row

I could use fp.read(%lf), if I knew that the format was long float or whatever.

Any thoughts?

Achrome
  • 7,773
  • 14
  • 36
  • 45
user1462620
  • 288
  • 2
  • 9
  • 1
    You can try just using fp.read(number) where you process that limited data and save the bit that has been split off your remaining data for further processing after a new read. – Octipi Feb 21 '13 at 00:03
  • If you had a `read_until(fp, ',')` function, could you build the rest yourself? If so, I can point you to such functions. If not, I can try to explain how to build the rest. – abarnert Feb 21 '13 at 00:09
  • I think that is the way to go. A read_until(fp,' ') function would allow me to then cast that string into the appropriate value (float, double, int, etc.) – user1462620 Feb 21 '13 at 01:41

3 Answers3

3

A million floats in text format really isn't that big... So unless it's proving a bottle neck of some sort, then I wouldn't worry about it and just do:

with open('file') as fin:
    my_data = [process_line(word) for word in fin.read().split()]

A possible alternative (assuming space delimited "words") is something like:

import mmap, re

with open('whatever.txt') as fin:
    mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
    for word in re.finditer(r'(.*?)\s', mf):
        print word.group(1)

And that'll scan the entire file and effectively give a massive word stream, regardless of rows / columns.

Jon Clements
  • 138,671
  • 33
  • 247
  • 280
  • Well, I just used a million as an example. The actual size of the file is such that a row will NOT fit into main memory. I have figured out how to do what I need on only a subset of the row (and scan over the row via this subset), but I don't know how to make python only read a subset of the row. – user1462620 Feb 21 '13 at 01:21
1

There are two basic ways to approach this:

First, you can write a read_column function with its own explicit buffer, either as a generator function:

def column_reader(fp):
    buf = ''
    while True:
        col_and_buf = self.buf.split(',', 1)
        while len(col_and_buf) == 1:
            buf += fp.read(4096)
            col_and_buf = buf.split(',', 1)
        col, buf = col_and_buf
        yield col

… or as a class:

class ColumnReader(object):
    def __init__(self, fp):
        self.fp, self.buf = fp, ''
    def next(self):
        col_and_buf = self.buf.split(',', 1)
        while len(col_and_buf) == 1:
            self.buf += self.fp.read(4096)
            col_and_buf = self.buf.split(',', 1)
        self.buf = buf
        return col

But, if you write a read_until function that handles the buffering internally, then you can just do this:

next_col = read_until(fp, ',')[:-1]

There are multiple read_until recipes on ActiveState.

Or, if you mmap the file, you effectively get this for free. You can just treat the file as a huge string and use find (or regular expressions) on it. (This assumes the entire file fits within your virtual address space—probably not a problem in 64-bit Python builds, but in 32-bit builds, it can be.)


Obviously these are incomplete. They don't handle EOF, or newline (in real life you probably have six rows of a million columns, not one, right?), etc. But this should be enough to show the idea.

abarnert
  • 354,177
  • 51
  • 601
  • 671
0

you can accomplish this using yield.

def read_in_chunks(file_object, chunk_size=1024):
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data


f = open('your_file.txt')
for piece in read_in_chunks(f):
    process_data(piece)

Take a look at this question for more examples.

Community
  • 1
  • 1
chirinosky
  • 4,438
  • 1
  • 28
  • 39
  • This doesn't do the hard part—splitting the data into columns (and buffering partial columns between `read`s, and so on). In fact, this is really no different than just calling `read(1024)` in a loop; what does it add? – abarnert Feb 21 '13 at 00:10
  • @abarnert do these "columns" take a fixed number of bytes? – Navin Feb 21 '13 at 00:13
  • @Navin: Since I didn't ask the question, I'm only guessing… but I'd guess not. Usually when people talk about columns they mean something like .csv files. – abarnert Feb 21 '13 at 00:17
  • @abarnert What separates your columns? You should be able to split the columns by their delimeter and user regular expressions to do what you're trying. – chirinosky Feb 21 '13 at 00:19
  • @Ramces: As I just said to Navin, it's not my question, so I don't know. But "split the columns by their delimiter" doesn't work if you're just reading 1K chunks, because a column can (and in fact usually will) span two chunks. This means the caller needs to buffer up the chunks anyway. At which point, it's not getting anything out of this that a `while` loop and `f.read(1024)` wouldn't give. (And I'm not sure why you think regular expressions will help at all with that.) – abarnert Feb 21 '13 at 00:21
  • Actually, if you go back and read the question: He wants to read the next float in the same way that C's `fscanf` does it. Which actually means something like "skip whitespace, then read characters until they stop looking like part of a float in %f format". So, not fixed-width columns. – abarnert Feb 21 '13 at 00:25
  • Exactly. If I know the format of the data in advance (i.e. first n columns are floats, then the next k are %e, etc) then this is trivial in C. In python, I would exactly like to do what abarnert says above: skip whitespace then read characters until they stop looking like a part of a float in %f format (or a double in %d, etc) – user1462620 Feb 21 '13 at 01:24