Python readlines faster than read

Question

This is related to the In Python, is read() , or readlines() faster? but not exactly the same. I have a small file to read many many times. I found out that reading it with readlines() and joining is faster than reading with read(). I could not find a good explanation for that but it puzzles me.

In [34]: cat test.txt
ATOM      1  N   MET A   1      -1.112 -18.674 -30.756  1.00 16.53           N  
ATOM      2  CA  MET A   1       0.327 -18.325 -30.772  1.00 16.53           C  
ATOM      3  C   MET A   1       0.513 -16.897 -31.160  1.00 16.53           C  
ATOM      4  O   MET A   1      -0.063 -15.998 -30.552  1.00 16.53           O  
ATOM      5  CB  MET A   1       1.083 -19.211 -31.777  1.00 16.53           C  
ATOM      6  CG  MET A   1       1.101 -20.691 -31.391  1.00 16.53           C  
ATOM      7  SD  MET A   1       1.989 -21.764 -32.559  1.00 16.53           S  
ATOM      8  CE  MET A   1       3.635 -21.109 -32.159  1.00 16.53           C  
ATOM      9  N   LYS A   2       1.333 -16.657 -32.199  1.00146.35           N  
ATOM     10  CA  LYS A   2       1.595 -15.313 -32.613  1.00146.35           C  

In [35]: timeit open("test.txt").read()
10000 loops, best of 3: 58.7 µs per loop

In [36]: timeit "\n".join(open("test.txt").readlines())
10000 loops, best of 3: 56.4 µs per loop

The result is pretty consistent.

Why don't you read this small file only once and keep it in memory? — Jongware, Jul 11 '18 at 08:43
It is a status file (here in the example it is not). It has to be read from the disk every time because it could be modified by other processes. — guma44, Jul 11 '18 at 09:09
@LutzHorn It might be not relevant if you do it once but if you do it milions of times that will count. For me it is just counterintitive. We wanted to change the code to just read() but we thought let's measure it :D. — guma44, Jul 11 '18 at 09:12
@guma44 You already do it 10,000 times using timeit. Do you plan to read such a file millions of times? — , Jul 11 '18 at 09:13
@LutzHorn yes. It is for the web service. I agree this is not the best optimization one can do but I was just interested in the result and surprised that on my setup it is slower. — guma44, Jul 11 '18 at 09:57
Compared to the costs of network I/O and disk I/O and the sheer overhead of using an interpreted language, this doesn't matter. You'll get much much more mileage over using something like a database or a cache to avoid re-reading the file every time. — David Maze, Jul 11 '18 at 10:32
I agree. For the database, unfortunately, it is legacy code that needs to be maintained in this form. — guma44, Jul 11 '18 at 11:44

AKX · Accepted Answer · 2018-07-11T10:31:26.653

For a file that small, it doesn't make a difference.

For a larger file...

import timeit

data = '''
ATOM      1  N   MET A   1      -1.112 -18.674 -30.756  1.00 16.53           N  
ATOM      2  CA  MET A   1       0.327 -18.325 -30.772  1.00 16.53           C  
ATOM      3  C   MET A   1       0.513 -16.897 -31.160  1.00 16.53           C  
ATOM      4  O   MET A   1      -0.063 -15.998 -30.552  1.00 16.53           O  
ATOM      5  CB  MET A   1       1.083 -19.211 -31.777  1.00 16.53           C  
ATOM      6  CG  MET A   1       1.101 -20.691 -31.391  1.00 16.53           C  
ATOM      7  SD  MET A   1       1.989 -21.764 -32.559  1.00 16.53           S  
ATOM      8  CE  MET A   1       3.635 -21.109 -32.159  1.00 16.53           C  
ATOM      9  N   LYS A   2       1.333 -16.657 -32.199  1.00146.35           N  
ATOM     10  CA  LYS A   2       1.595 -15.313 -32.613  1.00146.35           C  
'''.lstrip()

names_and_sizes = []

for x in range(1, 10):
    reps = 1 + 2 ** (x + 2)
    with open('test_{}.txt'.format(x), 'w') as outf:
        for x in range(reps):
            outf.write(data)
        names_and_sizes.append((outf.name, outf.tell()))

for filename, size in names_and_sizes:
    a = timeit.timeit(lambda: open(filename).read(), number=1000)
    b = timeit.timeit(lambda: "\n".join(open(filename).readlines()), number=1000)
    print(filename, size, a, b)

the output is

test_1.txt 7290 0.07285173307172954 0.09389211190864444
test_2.txt 13770 0.08125667599961162 0.1290126950480044
test_3.txt 26730 0.08221574104391038 0.17529957089573145
test_4.txt 52650 0.0865904720267281 0.2977212209952995
test_5.txt 104490 0.1046126070432365 0.5687746809562668
test_6.txt 208170 0.1773586180061102 1.1868972890079021
test_7.txt 415530 0.26339677802752703 2.0290830068988726
test_8.txt 830250 0.31897587003186345 4.381448873900808
test_9.txt 1659690 0.6923789769643918 9.483053435920738

or more intuitively

(and with both axes being logarithmic)

Thanks, that's intuitive but it does not explain why on a small scale it is slower. — guma44, Jul 11 '18 at 09:10
If you look at the numbers in my example, split-and-join is actually always slower. — AKX, Jul 11 '18 at 09:50
(Only on Python 2 and with the original 810 byte file is split-and-join 3% faster.) — AKX, Jul 11 '18 at 09:53
You can compare the implementations of file.read() and file.readlines() over at Github: https://github.com/python/cpython/blob/2.7/Objects/fileobject.c#L1070-L1149 and https://github.com/python/cpython/blob/2.7/Objects/fileobject.c#L1680-L1817 respectively. I imagine it might have to do with the fact that `.read()` has to dynamically (re)allocate the buffer it is reading into, as it does not know the size of what is to be read beforehand. — AKX, Jul 11 '18 at 10:02
Note that the graph is misleading (it has a log-scale x axis but a linear y axis), and the result is kind of intuitive: reading a file without processing it is faster than reading a file, splitting it into lines, constructing a list, and reconstituting a single string out of it. — David Maze, Jul 11 '18 at 10:29

Python readlines faster than read

1 Answers1