0

This is related to the In Python, is read() , or readlines() faster? but not exactly the same. I have a small file to read many many times. I found out that reading it with readlines() and joining is faster than reading with read(). I could not find a good explanation for that but it puzzles me.

In [34]: cat test.txt
ATOM      1  N   MET A   1      -1.112 -18.674 -30.756  1.00 16.53           N  
ATOM      2  CA  MET A   1       0.327 -18.325 -30.772  1.00 16.53           C  
ATOM      3  C   MET A   1       0.513 -16.897 -31.160  1.00 16.53           C  
ATOM      4  O   MET A   1      -0.063 -15.998 -30.552  1.00 16.53           O  
ATOM      5  CB  MET A   1       1.083 -19.211 -31.777  1.00 16.53           C  
ATOM      6  CG  MET A   1       1.101 -20.691 -31.391  1.00 16.53           C  
ATOM      7  SD  MET A   1       1.989 -21.764 -32.559  1.00 16.53           S  
ATOM      8  CE  MET A   1       3.635 -21.109 -32.159  1.00 16.53           C  
ATOM      9  N   LYS A   2       1.333 -16.657 -32.199  1.00146.35           N  
ATOM     10  CA  LYS A   2       1.595 -15.313 -32.613  1.00146.35           C  

In [35]: timeit open("test.txt").read()
10000 loops, best of 3: 58.7 µs per loop

In [36]: timeit "\n".join(open("test.txt").readlines())
10000 loops, best of 3: 56.4 µs per loop

The result is pretty consistent.

guma44
  • 23
  • 3
  • 2
    The difference of 2.3 µs is not relevant. –  Jul 11 '18 at 08:41
  • Why don't you read this small file only once and keep it in memory? – Jongware Jul 11 '18 at 08:43
  • It is a status file (here in the example it is not). It has to be read from the disk every time because it could be modified by other processes. – guma44 Jul 11 '18 at 09:09
  • @LutzHorn It might be not relevant if you do it once but if you do it milions of times that will count. For me it is just counterintitive. We wanted to change the code to just read() but we thought let's measure it :D. – guma44 Jul 11 '18 at 09:12
  • @guma44 You already do it 10,000 times using timeit. Do you plan to read such a file millions of times? –  Jul 11 '18 at 09:13
  • @LutzHorn yes. It is for the web service. I agree this is not the best optimization one can do but I was just interested in the result and surprised that on my setup it is slower. – guma44 Jul 11 '18 at 09:57
  • Compared to the costs of network I/O and disk I/O and the sheer overhead of using an interpreted language, this doesn't matter. You'll get much much more mileage over using something like a database or a cache to avoid re-reading the file every time. – David Maze Jul 11 '18 at 10:32
  • I agree. For the database, unfortunately, it is legacy code that needs to be maintained in this form. – guma44 Jul 11 '18 at 11:44

1 Answers1

3

For a file that small, it doesn't make a difference.

For a larger file...

import timeit

data = '''
ATOM      1  N   MET A   1      -1.112 -18.674 -30.756  1.00 16.53           N  
ATOM      2  CA  MET A   1       0.327 -18.325 -30.772  1.00 16.53           C  
ATOM      3  C   MET A   1       0.513 -16.897 -31.160  1.00 16.53           C  
ATOM      4  O   MET A   1      -0.063 -15.998 -30.552  1.00 16.53           O  
ATOM      5  CB  MET A   1       1.083 -19.211 -31.777  1.00 16.53           C  
ATOM      6  CG  MET A   1       1.101 -20.691 -31.391  1.00 16.53           C  
ATOM      7  SD  MET A   1       1.989 -21.764 -32.559  1.00 16.53           S  
ATOM      8  CE  MET A   1       3.635 -21.109 -32.159  1.00 16.53           C  
ATOM      9  N   LYS A   2       1.333 -16.657 -32.199  1.00146.35           N  
ATOM     10  CA  LYS A   2       1.595 -15.313 -32.613  1.00146.35           C  
'''.lstrip()

names_and_sizes = []

for x in range(1, 10):
    reps = 1 + 2 ** (x + 2)
    with open('test_{}.txt'.format(x), 'w') as outf:
        for x in range(reps):
            outf.write(data)
        names_and_sizes.append((outf.name, outf.tell()))

for filename, size in names_and_sizes:
    a = timeit.timeit(lambda: open(filename).read(), number=1000)
    b = timeit.timeit(lambda: "\n".join(open(filename).readlines()), number=1000)
    print(filename, size, a, b)

the output is

test_1.txt 7290 0.07285173307172954 0.09389211190864444
test_2.txt 13770 0.08125667599961162 0.1290126950480044
test_3.txt 26730 0.08221574104391038 0.17529957089573145
test_4.txt 52650 0.0865904720267281 0.2977212209952995
test_5.txt 104490 0.1046126070432365 0.5687746809562668
test_6.txt 208170 0.1773586180061102 1.1868972890079021
test_7.txt 415530 0.26339677802752703 2.0290830068988726
test_8.txt 830250 0.31897587003186345 4.381448873900808
test_9.txt 1659690 0.6923789769643918 9.483053435920738

or more intuitively

chart of time spent

(and with both axes being logarithmic)

enter image description here

AKX
  • 152,115
  • 15
  • 115
  • 172
  • Thanks, that's intuitive but it does not explain why on a small scale it is slower. – guma44 Jul 11 '18 at 09:10
  • If you look at the numbers in my example, split-and-join is actually always slower. – AKX Jul 11 '18 at 09:50
  • (Only on Python 2 and with the original 810 byte file is split-and-join 3% faster.) – AKX Jul 11 '18 at 09:53
  • And that's my setup. – guma44 Jul 11 '18 at 09:58
  • You can compare the implementations of file.read() and file.readlines() over at Github: https://github.com/python/cpython/blob/2.7/Objects/fileobject.c#L1070-L1149 and https://github.com/python/cpython/blob/2.7/Objects/fileobject.c#L1680-L1817 respectively. I imagine it might have to do with the fact that `.read()` has to dynamically (re)allocate the buffer it is reading into, as it does not know the size of what is to be read beforehand. – AKX Jul 11 '18 at 10:02
  • I would add it to the answer. That sounds reasonable. – guma44 Jul 11 '18 at 10:19
  • Note that the graph is misleading (it has a log-scale x axis but a linear y axis), and the result is kind of intuitive: reading a file without processing it is faster than reading a file, splitting it into lines, constructing a list, and reconstituting a single string out of it. – David Maze Jul 11 '18 at 10:29