1

I have to open several thousand files, but only read the first 3 lines.

Currently, I am doing this:

def test_readline(filename):
    fid = open(filename, 'rb')
    lines = [fid.readline() for i in range(3)]

Which yields the result:

The slowest run took 10.20 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 59.2 µs per loop

An alternate solution would be to convert the fid to a list:

def test_list(filename):
    fid = open(filename, 'rb')
    lines = list(fid) 

%timeit test_list(MYFILE)

The slowest run took 4.92 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 374 µs per loop

Yikes!! Is there a faster way to only read the first 3 lines of these files, or is readline() the best? Can you respond with alternatives and timings please?

But at the end-of-the-day I have to open thousands of individual files and they will not be cached. Thus, does it even matter (looks like it does)?

(603µs uncached method readline vs. 1840µs list method)

Additionally, here is the readlines() method:

def test_readlines(filename):
    fid = open(filename, 'rb')
    lines = fid.readlines() 
    return lines

The slowest run took 7.17 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 334 µs per loop

name goes here
  • 280
  • 4
  • 10
  • 600µs for 1000 files is still just 0.6 seconds. Not bad for operating on 1000 files I'd say. "Faster" is fine, but at what point is it *too slow*? How often do you have to do this and how fast does it need to be? – deceze Aug 18 '17 at 13:27
  • If you know that the first three lines won’t ever exceed a certain size and you’re okay with overshooting, `.readlines()` also accepts a parameter with a maximum number of bytes or characters to read. Little weird for most situations though. – Ry- Aug 18 '17 at 13:29
  • 600µs is per file. So it does 'add up'. And I am doing a lot of other things later in the code. Every bit helps and trying to optimize. – name goes here Aug 18 '17 at 13:55
  • @Ryan You should add as an answer instead of a comment and I can time it. But from what I know of readlines() is that it would read the entire file first. – name goes here Aug 18 '17 at 14:04
  • @evanleeturner: It’s conditional on something you haven’t answered. Do the first three lines have a hard size limit? – Ry- Aug 18 '17 at 14:04
  • @Ryan, no they do not have a hard limit – name goes here Aug 18 '17 at 14:06
  • @evanleeturner: Then that approach doesn’t work. – Ry- Aug 18 '17 at 14:08
  • @Ryan It's variable, but I still have to pull the first three because the data in the first 3 'lines' is what I need for this program – name goes here Aug 18 '17 at 14:12
  • It's a little bit off topic, but are you using `multiprocessing`? Although not helping to read faster definitely helps reading more at the same time – Adonis Aug 18 '17 at 14:57
  • If you [*randomly pause it*](https://stackoverflow.com/a/4299378/23771) I'm pretty sure you're going to find that nearly all the time is spent opening and closing files, not much in the read. – Mike Dunlavey Aug 18 '17 at 15:34
  • @Adonis I am not multiprocessing, but would that really engage in a speedup since the bottleneck would be accessing disk? IE, the processors would be waiting for disk requests. – name goes here Aug 18 '17 at 15:43
  • Correct, [this answer](https://stackoverflow.com/a/902455/4121573) pretty much answers your concern. My point here is that you are might do other actions on those 3 lines (converting some string to integers or whatever...etc), and that's where you might get a performance increase. Of course, if those 1000 files are on different disks, multiprocessing would be valuable – Adonis Aug 18 '17 at 15:59

1 Answers1

1

You can slice an iterable with itertools.islice:

import itertools


def test_list(filename):
    with open(filename, 'r', encoding='utf-8') as f:
        return list(itertools.islice(f, 3))

(I changed the open up a bit because it’s slightly unusual to read files in binary mode by line, but you can revert that.)

Ry-
  • 218,210
  • 55
  • 464
  • 476
  • I had to modify this slightly to work. I get a UTF-8 error when I run on my files because they are binary, so I removed encoding= and added back 'rb'. on timeit this yields The slowest run took 35.51 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 53.6 µs per loop – name goes here Aug 18 '17 at 14:03
  • I've ran timeit multiple times: ~824 µs is the total noncache time, so still not faster than readline The slowest run took 15.32 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 53.8 µs per loop – name goes here Aug 18 '17 at 14:08
  • 1
    @evanleeturner: Yeah, this answer is just more Pythonic. It’s going to be the same speed as multiple `.readline()` calls, and you’re not going to get much faster than that. – Ry- Aug 18 '17 at 14:15