Three different solutions:
1) Quick and dirty, see John's answer:
with open(file_name) as fid:
lines = fid.readlines()
for line in lines[:-n_skip]:
do_something_with(line)
The disadvantage of this method is that you have to read all lines in memory first, which might be a problem for big files.
2) Two passes
Process the file twice, once to count the number of lines n_lines
, and in a second pass process only the first n_lines - n_skip
lines:
# first pass to count
with open(file_name) as fid:
n_lines = sum(1 for line in fid)
# second pass to actually do something
with open(file_name) as fid:
for i_line in xrange(n_lines - n_skip): # does nothing if n_lines <= n_skip
line = fid.readline()
do_something_with(line)
The disadvantage of this method is that you have to iterate over the file twice, which might be slower in some cases. The good thing, however, is that you never have more than one line in memory.
3) Use a buffer, similar to Serge's solution
In case you want to iterate over the file just once, you only know for sure that you can process line i
if you know that line i + n_skip
exists. This means that you have to keep n_skip
lines in a temporary buffer first. One way to do this is to implement some sort of FIFO buffer (e.g. with a generator function that implements a circular buffer):
def fifo(it, n):
buffer = [None] * n # preallocate buffer
i = 0
full = False
for item in it: # leaves last n items in buffer when iterator is exhausted
if full:
yield buffer[i] # yield old item before storing new item
buffer[i] = item
i = (i + 1) % n
if i == 0: # wrapped around at least once
full = True
Quick test with a range of numbers:
In [12]: for i in fifo(range(20), 5):
...: print i,
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
The way you would use this with your file:
with open(file_name) as fid:
for line in fifo(fid, n_skip):
do_something_with(line)
Note that this requires enough memory to temporary store n_skip
lines, but this is still better than reading all lines in memory as in the first solution.
Which one of these 3 methods is the best is a trade-off between code complexity, memory and speed, which depends on your exact application.