With a limited test in ipython using %%timeit
and sample data only 1000 lines long, #1 does indeed seem to be the fastest, but it is extremely negligible. Using 1,000,000 blank lines, a bigger difference can be seen, with an approach that you did not consider earlier pulling ahead.
To determine the relative performance between different blocks of code, you need to profile the code in question. One of the easiest ways to profile a given function or short snippet of code is using the %timeit
"magic" command in ipython.
For this test, initially I used the following sample data:
chars = [chr(c) for c in range(97, 123)]
line = ','.join(c * 5 for c in chars)
# 'aaaaa,bbbbb,ccccc,ddddd,eeeee,fffff,ggggg,hhhhh,iiiii,jjjjj,kkkkk,lllll,mmmmm,nnnnn,ooooo,ppppp,qqqqq,rrrrr,sssss,ttttt,uuuuu,vvvvv,wwwww,xxxxx,yyyyy,zzzzz'
with open('test.txt', 'w', encoding='utf-8') as f:
f.write('\n'.join(line for _ in range(1000)))
The approach that was the fastest:
>>> %%timeit
... with open('test.txt', 'r', encoding='utf-8') as f:
... next(f) # roughly equivalent to f.readline()
... data = f.readlines()
...
166 µs ± 185 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
The other two examples you had were slightly slower:
>>> %%timeit
... with open('test.txt', 'r', encoding='utf-8') as f:
... data = f.readlines()[1:]
...
177 µs ± 5.06 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %%timeit
... with open('test.txt', 'r', encoding='utf-8') as f:
... data = f.readlines()
... del data[0]
...
168 µs ± 893 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Using 1,000,000 blank lines as follows, we can see a bigger difference between approaches:
with open('test_1.txt', 'w', encoding='utf-8') as f:
f.write('\n' * 1_000_000)
The initial approaches:
>>> %%timeit
... with open('test_1.txt', 'r', encoding='utf-8') as f:
... next(f)
... data = f.readlines()
...
20.4 ms ± 226 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %%timeit
... with open('test_1.txt', 'r', encoding='utf-8') as f:
... f.readline()
... data = f.readlines()
...
20.6 ms ± 197 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %%timeit
... with open('test_1.txt', 'r', encoding='utf-8') as f:
... data = f.readlines()[1:]
...
22.2 ms ± 250 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %%timeit
... with open('test_1.txt', 'r', encoding='utf-8') as f:
... data = f.readlines()
... del data[0]
...
20.7 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The slice approach takes the longest since it needs to do more work to construct a new list.
Alternate approaches that pulled ahead included reading the file in its entirety in one .read()
call, then splitting it:
>>> %%timeit
... with open('test_1.txt', 'r', encoding='utf-8') as f:
... data = f.read().splitlines()
... del data[0]
...
15.8 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %%timeit
... with open('test_1.txt', 'r', encoding='utf-8') as f:
... data = f.read().split('\n', 1)[1].splitlines()
...
15.2 ms ± 220 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %%timeit
... with open('test_1.txt', 'r', encoding='utf-8') as f:
... next(f)
... data = f.read().splitlines()
...
15.2 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The absolute fastest approach I have found so far involved reading the file as binary data, then decoding after reading:
>>> %%timeit
... with open('test_1.txt', 'rb') as f:
... next(f)
... data = f.read().decode('utf-8').splitlines()
...
14.2 ms ± 171 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In the end, it depends on how much data you need to read, and how much memory you have available. For files with fewer lines, the difference between approaches is extremely negligible.
Avoiding slicing in this scenario is always preferable. Reading more data in fewer system calls generally produces faster results because more of the post-processing operations can be performed in memory instead of on a file handle. If you don't have enough memory though, then this may not be possible.
Note that for any of these approaches, run times can vary between trials. In my original test with 1k lines, the approach that was tied for fastest on the first run was slower on a later run:
>>> %%timeit
... with open('test.txt', 'r', encoding='utf-8') as f:
... next(f)
... data = f.readlines()
...
172 µs ± 4.09 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
It's also important to note that premature optimization is the root of all evil - if this is not a major bottleneck in your program (as revealed by profiling your code), then it's not worth spending a lot of time on it.
I would recommend reviewing some more resources about how to profile your code: