2

I'd like to read a file line by line, except for the last N lines. How do I know where to stop, without reaching the end of the file and back tracking / discarding the last N lines, in Python? Is asking for # lines = X, and looping (X-N) a good way to go about this?

What's the simplest / most Pythonic way of doing this?

Deniz
  • 1,481
  • 3
  • 16
  • 21
  • 5
    In general, if lines can be of variable length, there is *no way*, Pythonic or otherwise, of knowing how many lines are in a part of the file you haven't read. – Guy Sirton Nov 02 '14 at 05:37
  • you can read file using `readlines` then apply `len` to get total number of lines in the file, now you can do – Hackaholic Nov 02 '14 at 05:39
  • 1
    @Hackaholic you've just read those lines though... Rather than len you can just slice it [:-N] ... which is "discarding the last N lines"... – Guy Sirton Nov 02 '14 at 05:42
  • yep right, slicing will be better – Hackaholic Nov 02 '14 at 05:44
  • At some level it might seem / I might be asking a dumb question. A line, after all, is '\n', and how can Python know how many of these are left, without actually reading the file on disk... So the bulk of the question is regarding how to do this elegantly. – Deniz Nov 02 '14 at 06:03
  • Is it too big to use `readlines`? – hpaulj Nov 02 '14 at 06:31

4 Answers4

2

Unless you have a way to know in advance the actual number of lines, you will have to read the whole file.

But as I assume you want to process the file line by line except the N last line, you can do it without loading all the file in memory, and keeping only a list of N lines :

with open(file) as fd:
    lines = []
    try:
        for i in range(N):
            lines.append(next(fd))

        i = 0
        for line in fd:
            # process lines[i]
            print (lines[i].rstrip())
            lines[i] = line
            i = (i + 1) % N
    except StopIteration:
        print "less than %d lines" % (N,)
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
2

Three different solutions:

1) Quick and dirty, see John's answer:

with open(file_name) as fid:
    lines = fid.readlines()
for line in lines[:-n_skip]:
    do_something_with(line)

The disadvantage of this method is that you have to read all lines in memory first, which might be a problem for big files.

2) Two passes

Process the file twice, once to count the number of lines n_lines, and in a second pass process only the first n_lines - n_skip lines:

# first pass to count
with open(file_name) as fid:
    n_lines = sum(1 for line in fid)

# second pass to actually do something
with open(file_name) as fid:
    for i_line in xrange(n_lines - n_skip):  # does nothing if n_lines <= n_skip
        line = fid.readline()
        do_something_with(line)

The disadvantage of this method is that you have to iterate over the file twice, which might be slower in some cases. The good thing, however, is that you never have more than one line in memory.

3) Use a buffer, similar to Serge's solution

In case you want to iterate over the file just once, you only know for sure that you can process line i if you know that line i + n_skip exists. This means that you have to keep n_skip lines in a temporary buffer first. One way to do this is to implement some sort of FIFO buffer (e.g. with a generator function that implements a circular buffer):

def fifo(it, n):
    buffer = [None] * n  # preallocate buffer
    i = 0
    full = False
    for item in it:  # leaves last n items in buffer when iterator is exhausted
        if full:
            yield buffer[i]  # yield old item before storing new item
        buffer[i] = item
        i = (i + 1) % n
        if i == 0:  # wrapped around at least once
            full = True

Quick test with a range of numbers:

In [12]: for i in fifo(range(20), 5):
    ...:     print i,
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

The way you would use this with your file:

with open(file_name) as fid:
    for line in fifo(fid, n_skip):
        do_something_with(line)

Note that this requires enough memory to temporary store n_skip lines, but this is still better than reading all lines in memory as in the first solution.

Which one of these 3 methods is the best is a trade-off between code complexity, memory and speed, which depends on your exact application.

Bas Swinckels
  • 18,095
  • 3
  • 45
  • 62
1

To read all lines up to the last X lines you need to know where the last X lines begin. You will need this information somewhere. There are several ways to get this information.

  1. When you write the file save the position of the last X lines. Stop reading when reaching that position.
  2. Store the positions of the line beginnings somewhere, this allows appending to the file.
  3. You know the size of the lines.
    1. Each line could have the same size and you compute it out of the file size
    2. Each line has at least one character so you do not need to read the last X characters.
User
  • 14,131
  • 2
  • 40
  • 59
1

Given we know the file must be read to the end to determine how many lines there are, here's my attempt at the "simplest / most Pythonic way" of reading up to the last n lines:

with open(foo, 'r') as f:
    lines = f.readlines()[:-n]
John B
  • 3,566
  • 1
  • 16
  • 20
  • Of course, I don't know why I didn't write it that way initially, tired I guess :) – John B Nov 02 '14 at 10:18
  • 1
    In LA, eh? In Italy we use to wish "Good night and dreams of gold!" – gboffi Nov 02 '14 at 11:17
  • This is an easy solution for small files, but for very large files, you don't want to read all lines in memory using `readlines()`, you typically want to process them lazily as you read them. – Bas Swinckels Nov 02 '14 at 11:24
  • @BasSwinckels That's definitely true for large files, but this is just an attempt at the simplest approach. – John B Nov 03 '14 at 11:48