Python: Most optimal way to read file line by line

Question

I have a large input file I need to read from so I don't want to use enumerate or fo.readlines(). for line in fo: in the traditional way won't work and I'll state why, but I feel some modification to that is what I need right now. Consider the following file:

 input_file.txt:
 3 # No of tests that will follow
 3 # No of points in current test
 1 # 1st x-coordinate
 2 # 2nd x-coordinate
 3 # 3rd x-coordinate
 2 # 1st y-coordinate
 4 # 2nd y-coordinate
 6 # 3rd y-coordinate
 ...

What I need is to be able to read variable chunks of lines, pair the coordinates in tuple, add tuple to a list of cases and move back to reading a new case from the file.

I thought of this:

with open(input_file) as f:
    T = int(next(f)) 
    for _ in range(T):
        N = int(next(f))
        for i in range(N):
            x.append(int(f.next()))
        for i in range(N):
            y.append(int(f.next()))

Then couple the two lists into a tuple. I feel there must be a cleaner way to do this. Any suggestions?

EDIT: The y-coordinates will have to have a separate for loop to read. They are x and y coordinates are n lines apart. So Read line i; Read line (i+n); Repeat n times - for each case.

I'm not sure I fully understand your input format and desired output stream of data structures. Could you provide a short example, please? — 5gon12eder, May 14 '16 at 22:26
@Copperfield I want to read 2 lines separated by N lines between them. — Utumbu, May 14 '16 at 22:33
@5gon12eder the outputs the above case would be [(1,2),(2,4),(4,6)]... That's the first test case — Utumbu, May 14 '16 at 22:35
So you want a sequence of lists of tuples where each list holds the coordinates for a single test? — 5gon12eder, May 14 '16 at 22:42
Precisely. Something like (Read line i; read line i+n) repeat n times. Do that for each test case — Utumbu, May 14 '16 at 22:42
you want to pair the 1st x with the 1st y, 2nd x with the 2nd y and soo on, right? — Copperfield, May 14 '16 at 22:56

score 3 · Accepted Answer · answered May 14 '16 at 22:59

This might not be the shortest possible solution but I believe it is “pretty optimal”.

def parse_number(stream):
    return int(next(stream).partition('#')[0].strip())

def parse_coords(stream, count):
    return [parse_number(stream) for i in range(count)]

def parse_test(stream):
    count = parse_number(stream)
    return list(zip(parse_coords(stream, count), parse_coords(stream, count)))

def parse_file(stream):
    for i in range(parse_number(stream)):
        yield parse_test(stream)

It will eagerly parse all coordinates of a single test but each test will only be parsed lazily as you ask for it.

You can use it like this to iterate over the tests:

if __name__ == '__main__':
    with open('input.txt') as istr:
        for test in parse_file(istr):
            print(test)

Better function names might be desired to better distinguish eager from lazy functions. I'm experiencing a lack of naming creativity right now.

+1 for the clean and clear code. `zip(parse_coords(stream, count), parse_coords(stream, count))` Doesn't this tuple have the same element twice? What am I missing here. — Utumbu, May 15 '16 at 00:46
I admit this is a bit tricky. Since `parse_coords` is evaluated eagerly, the first call consumes the `count` *x* coordinates from the stream and the second call will consume the `count` *y* coordinates. The `zip` will then zip two already fully constructed `list`s. Note that the `stream` is passed by-reference and mutated by the functions. — 5gon12eder, May 15 '16 at 01:33

Copperfield · Answer 2 · 2016-05-14T23:33:39.453

2

how about this with the grouper recipe

from itertools import zip_longest

def grouper(iterable, n, fillvalue=None):
    """Collect data into fixed-length chunks or blocks
        grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"""
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)

with open(input_file) as archi:
    T = int(next(archi))
    N = int(next(archi))
    points = [ g for g in grouper(map(int,archi),N) ]
    print(points) # [(1, 2, 3), (2, 4, 6)]
    result = list( zip(*points) )
    print(result) #  [(1, 2), (2, 4), (3, 6)]

here I use grouper to read N lines at the time getting a list of tuples with all the x and all the y, then use zip to pair all those together

edited May 14 '16 at 23:33

answered May 14 '16 at 22:57

Copperfield

8,131
3
23
29

You should read again the example file in the question, noting what is the number of cases and what is the the number of points in a case. Hint: the OP made the same mistake and so you can say you've been misleaded... – gboffi May 14 '16 at 23:08
@gboffi Right you are. I've corrected that. +1 for the an amazing use of grouper. However, the reason I haven't accepted, and the reason I'm not using my `f.next()` version is because I'm looking for something that's more concise or, let's say, elegant. – Utumbu May 15 '16 at 00:52

Jim Dennis · Answer 3 · 2016-06-04T23:56:28.757

It sounds like you're not really trying to "read a file line by line." It sounds like you want to skip around the file, treating it like a large list/array but without triggering excessive memory consumption.

Have you looked at the mmap module? With that you can use methods like .find() to find newlines, optionally starting at some offset (such as just past your current test header) and .seek() to move the file pointer to the nth item you've found and then .readline() and so on.

An mmap object shares some methods and properties of a string or byte array and some from file like objects. So you can use a mixture of methods like .find() (normal for strings and byte arrays) and .seek() (for files).

Additionally the Python memory mapping uses your operating system's features for mapping files into memory. (On Linux and similar systems this is the same mechanism by which your shared libraries are mapped into the address space for all of your running processes, for example). The key point is that you memory is only used as a cache for the contents of the file, and the operating system will transparently perform the necessary I/O for loading and release memory buffers with the file's contents.

I don't see a method for finding the "nth" occurrence of some character or string ... so there's no way to skip to some line. As far as I can tell you'll have to loop over .find() but then you can skip back to any such line using Python's slice notation. You could write a utility class/object to scan for 1000 line terminators at a time, storing them in an index/list. Then you can use values from that in slices of the memory mapping.

Python: Most optimal way to read file line by line

3 Answers3

Linked