Efficiently finding the last line in a text file

Question

I need to extract the last line from a number of very large (several hundred megabyte) text files to get certain data. Currently, I am using python to cycle through all the lines until the file is empty and then I process the last line returned, but I am certain there is a more efficient way to do this.

What is the best way to retrieve just the last line of a text file using python?

Is this a Python question, or would an answer using awk or sed be just as good? — Eric Wilson, Aug 23 '11 at 20:32
You need to supply a vital piece of information (which many answers have totally ignored): the encoding of your file. — John Machin, Aug 23 '11 at 21:35
Only a multibyte encoding (e.g. UTF-16 or UTF-32) will break the algorithms given. — Mike DeSimone, Aug 24 '11 at 05:07
@Eric, this is at my office which is a windows environment, so Python is best, though powershell could work. — TimothyAWiseman, Aug 24 '11 at 20:06
You could take a look at: https://stackoverflow.com/questions/136168/get-last-n-lines-of-a-file-with-python-similar-to-tail It is really close to what you need. — Martin, Aug 23 '11 at 20:27
Thanks, that got me very close and a small bit of tweaking got exactly what I need. — TimothyAWiseman, Aug 24 '11 at 20:04

score 55 · Accepted Answer · answered Aug 23 '11 at 20:21

55

Not the straight forward way, but probably much faster than a simple Python implementation:

line = subprocess.check_output(['tail', '-1', filename])

answered Aug 23 '11 at 20:21

sth

222,467
53
283
367

2

you'll want to add a [0:-1] at the end, somehow its adding a '\n' at the end... – Carter Tazio Schonwald Jan 15 '14 at 18:40
4

It's not a very python solution – Maxime de Pachtere Feb 02 '18 at 09:02
I liked this one a lot but used it in shared code & discovered the flaw when people use it in windows - no tail function. So my preference (python 3.7, no formatting) is ... with open(filename, 'r') as f: line = f.readlines()[-1] – John 9631 May 10 '19 at 22:22
1

@John9631, your solution is very slow since readlines() are reading all lines in RAM, if the file size is in GB, that will give MEMORY error! – Anu Jul 05 '19 at 22:48
2

does windows support `tail` ? – Hojat Modaresi Jun 14 '20 at 05:49
@HojatModaresi there's no reason you couldn't put a `tail` program on your Windows PC, but no it doesn't come with one. – Mark Ransom Apr 25 '22 at 12:15

score 46 · Answer 2 · edited Jan 30 '17 at 15:41

46

with open('output.txt', 'r') as f:
    lines = f.read().splitlines()
    last_line = lines[-1]
    print last_line

edited Jan 30 '17 at 15:41

bfontaine

18,169
13
73
107

answered Jan 30 '17 at 15:13

mick barry

613
5
2

1

best solution and quick one – salah Mar 29 '19 at 12:56
44

Doesn't work very well when you're dealing with GB text files and all you need is a last line check. – john stamos Jun 07 '19 at 19:32
7

I think this is not effeciently when work with a very large text file. – Minh Hoàng Nov 06 '19 at 09:05
1

IndexError: list index out of range is any way to store more data – Mr Coder Mar 30 '20 at 19:14
You should check `if lines:` in case the file is empty before accessing index -1 – Katu Nov 12 '21 at 11:15

Mike DeSimone · Answer 3 · 2011-08-23T20:58:17.590

Use the file's seek method with a negative offset and whence=os.SEEK_END to read a block from the end of the file. Search that block for the last line end character(s) and grab all the characters after it. If there is no line end, back up farther and repeat the process.

def last_line(in_file, block_size=1024, ignore_ending_newline=False):
    suffix = ""
    in_file.seek(0, os.SEEK_END)
    in_file_length = in_file.tell()
    seek_offset = 0

    while(-seek_offset < in_file_length):
        # Read from end.
        seek_offset -= block_size
        if -seek_offset > in_file_length:
            # Limit if we ran out of file (can't seek backward from start).
            block_size -= -seek_offset - in_file_length
            if block_size == 0:
                break
            seek_offset = -in_file_length
        in_file.seek(seek_offset, os.SEEK_END)
        buf = in_file.read(block_size)

        # Search for line end.
        if ignore_ending_newline and seek_offset == -block_size and buf[-1] == '\n':
            buf = buf[:-1]
        pos = buf.rfind('\n')
        if pos != -1:
            # Found line end.
            return buf[pos+1:] + suffix

        suffix = buf + suffix

    # One-line file.
    return suffix

Note that this will not work on things that don't support seek, like stdin or sockets. In those cases, you're stuck reading the whole thing (like the tail command does).

I think this answer only works properly in Python 2. At least, it didn't work for me in Python 3, because you can't seek relative from the end of a text file in Python 3 (throws an io exception). To update this to Python 3: use a binary file, then you have to use byte arrays instead of strings for `buf` (make sure you compare `buf[-1:] == b'\n'`). You can use `suffix.decode('utf-8')` to return a string, if you're sure it's utf-8 encoded. — Multihunter, Jan 07 '20 at 02:04

score 8 · Answer 4 · answered Aug 23 '11 at 20:26

8

If you do know the maximal length of a line, you can do

def getLastLine(fname, maxLineLength=80):
    fp=file(fname, "rb")
    fp.seek(-maxLineLength-1, 2) # 2 means "from the end of the file"
    return fp.readlines()[-1]

This works on my windows machine. But I do not know what happens on other platforms if you open a text file in binary mode. The binary mode is needed if you want to use seek().

answered Aug 23 '11 at 20:26

rocksportrocker

7,251
2
31
48

2

And if you don't know the maximum line length? – Adam Rosenfield Aug 23 '11 at 20:28
1

both this and mike's answer are "the right way to do it", but have issues for anything other than simple (single byte, eg ASCII) text encodings. unicode can have multi-byte characters, so in that case (1) you don't know the relative offset in bytes for a given maximum length in characters and (2) you may seek into "the middle" of a character. – andrew cooke Aug 23 '11 at 20:31
@Adam, you can usually pick a number that is greater than any reasonable line length even if it isn't a guaranteed maximum. If you absolutely can't make any assumptions or accept a truncated line, you have no choice but to read the whole file. – Mark Ransom Aug 23 '11 at 20:34
1

@andrew, the end-of-line byte code in UTF-8 will still be unique even if you start in the middle of a character. That's one of the beauties of UTF-8. – Mark Ransom Aug 23 '11 at 20:35
@andrew, Windows' multi-byte line end causes more issues than UTF-8. I'll amend my answer to support it if someone needs it. – Mike DeSimone Aug 23 '11 at 21:14
@andrew cooke: Only UTF-16 and UTF-32 can have a CR or LF byte inside a character. They are fixed-length encodings. No designer of a variable-byte-count encoding has been silly enough to have CR or LF byte-values inside a character. – John Machin Aug 23 '11 at 21:27
that wasn't exactly my point; what i didn't know was how well utf-8 could synchronize mid-stream. but apparently it's not an issue. which is great. – andrew cooke Aug 24 '11 at 01:10
1

@andrew: UTF-8 can sync midstream because the bytes in the representation of a code point >= U+80 all have the high bit set. Therefore, if the high bit is clear, it's a low-ASCII character. This makes us parser writers happy. On the other hand, there are formats such as Shift-JIS, which encode non-low-ASCII characters as two bytes, but only the first byte is guaranteed to have a high bit set. Luckily, they didn't use control characters for the second byte. – Mike DeSimone Aug 24 '11 at 05:02
@Mike: It's not just that characters below U+80 have the high bit clear -- UTF-8 can resync midstream because because there is no overlap between the set of possible *initial* bytes and the set of possible non-initial bytes in all characters. The possible initial bytes all begin with 0 or 11, whereas the possible non-initial bytes begin with 10. – Adam Rosenfield Aug 24 '11 at 21:16
@Adam: When I said "the bytes in the representation of a code point >= U+80" I was referring to both the initial bytes and the non-initial bytes. (Plural "bytes" with singular "code point".) Sorry for the confusion. – Mike DeSimone Aug 25 '11 at 03:28
1

file() is not supported in Python 3 Use open() instead; – Ludo Schmidt Aug 07 '20 at 09:50

score 7 · Answer 5 · answered Aug 23 '11 at 20:27

7

If you can pick a reasonable maximum line length, you can seek to nearly the end of the file before you start reading.

myfile.seek(-max_line_length, os.SEEK_END)
line = myfile.readlines()[-1]

answered Aug 23 '11 at 20:27

Mark Ransom

299,747
42
398
622

I think you have to go one byte further in seek, because readlines() includes the line terminator. – rocksportrocker Aug 23 '11 at 20:30

score 5 · Answer 6 · answered Aug 23 '11 at 20:26

Seek to the end of the file minus 100 bytes or so. Do a read and search for a newline. If here is no newline, seek back another 100 bytes or so. Lather, rinse, repeat. Eventually you'll find a newline. The last line begins immediately after that newline.

Best case scenario you only do one read of 100 bytes.

score 2 · Answer 7 · answered Aug 23 '11 at 20:46

The inefficiency here is not really due to Python, but to the nature of how files are read. The only way to find the last line is to read the file in and find the line endings. However, the seek operation may be used to skip to any byte offset in the file. You can, therefore begin very close to the end of the file, and grab larger and larger chunks as needed until the last line ending is found:

from os import SEEK_END

def get_last_line(file):
  CHUNK_SIZE = 1024 # Would be good to make this the chunk size of the filesystem

  last_line = ""

  while True:
    # We grab chunks from the end of the file towards the beginning until we 
    # get a new line
    file.seek(-len(last_line) - CHUNK_SIZE, SEEK_END)
    chunk = file.read(CHUNK_SIZE)

    if not chunk:
      # The whole file is one big line
      return last_line

    if not last_line and chunk.endswith('\n'):
      # Ignore the trailing newline at the end of the file (but include it 
      # in the output).
      last_line = '\n'
      chunk = chunk[:-1]

    nl_pos = chunk.rfind('\n')
    # What's being searched for will have to be modified if you are searching
    # files with non-unix line endings.

    last_line = chunk[nl_pos + 1:] + last_line

    if nl_pos == -1:
      # The whole chunk is part of the last line.
      continue

    return last_line

`file.seek(-n, os.SEEK_END)` will raise `IOError: [Errno 22] Invalid argument` if `n` is greater than the file size. — Mike DeSimone, Aug 23 '11 at 21:16

score 1 · Answer 8 · answered Sep 23 '16 at 00:01

Here's a slightly different solution. Instead of multi-line, I focused on just the last line, and instead of a constant block size, I have a dynamic (doubling) block size. See comments for more info.

# Get last line of a text file using seek method.  Works with non-constant block size.  
# IDK if that speed things up, but it's good enough for us, 
# especially with constant line lengths in the file (provided by len_guess), 
# in which case the block size doubling is not performed much if at all.  Currently,
# we're using this on a textfile format with constant line lengths.
# Requires that the file is opened up in binary mode.  No nonzero end-rel seeks in text mode.
REL_FILE_END = 2
def lastTextFileLine(file, len_guess=1):
    file.seek(-1, REL_FILE_END)      # 1 => go back to position 0;  -1 => 1 char back from end of file
    text = file.read(1)
    tot_sz = 1              # store total size so we know where to seek to next rel file end
    if text != b'\n':        # if newline is the last character, we want the text right before it
        file.seek(0, REL_FILE_END)    # else, consider the text all the way at the end (after last newline)
        tot_sz = 0
    blocks = []           # For storing succesive search blocks, so that we don't end up searching in the already searched
    j = file.tell()          # j = end pos
    not_done = True
    block_sz = len_guess
    while not_done:
        if j < block_sz:   # in case our block doubling takes us past the start of the file (here j also = length of file remainder)
            block_sz = j
            not_done = False
        tot_sz += block_sz
        file.seek(-tot_sz, REL_FILE_END)         # Yes, seek() works with negative numbers for seeking backward from file end
        text = file.read(block_sz)
        i = text.rfind(b'\n')
        if i != -1:
            text = text[i+1:].join(reversed(blocks))
            return str(text)
        else:
            blocks.append(text)
            block_sz <<= 1    # double block size (converge with open ended binary search-like strategy)
            j = j - block_sz      # if this doesn't work, try using tmp j1 = file.tell() above
    return str(b''.join(reversed(blocks)))      # if newline was never found, return everything read

Ideally, you'd wrap this in a class LastTextFileLine and keep track of a moving average of line lengths. This would give you a good len_guess maybe.

score 0 · Answer 9 · answered Aug 23 '11 at 20:29

0

Could you load the file into a mmap, then use mmap.rfind(string[, start[, end]]) to find the second last EOL character in the file? A seek to that point in the file should point you to the last line I would think.

answered Aug 23 '11 at 20:29

ChrisC

1,282
9
9

score -3 · Answer 10 · answered Aug 23 '11 at 20:18

-3

lines = file.readlines()
fileHandle.close()
last_line = lines[-1]

answered Aug 23 '11 at 20:18

Jon Martin

3,252
5
29
45

2

Gah! Don't ever do `lines[len(lines) -1]`. That's an `O(n)` operation. `lines[-1]` will get the last one. Besides, this isn't any better than the approach he's already using. – g.d.d.c Aug 23 '11 at 20:20
Oops, my mistake! This method actually is more efficient though. – Jon Martin Aug 23 '11 at 20:21
11

@g.d.d.c: `lines[len(lines)-1]` is not O(n) (unless `lines` is a user-defined type with an O(n) implementation of `__len__`, but that's not the case here). While it's bad style, `lines[len(lines)-1]` has a practically identical runtime cost as `lines[-1]`; the only difference is whether the index calculation is done explicitly in script or implicitly by the runtime. – Adam Rosenfield Aug 23 '11 at 20:23
This, however, sounds very memory inefficient, as you have to read a possibly large file into memory before performing said `O(1)` operation. – gustafbstrom Oct 19 '18 at 10:37

score -6 · Answer 11 · edited Sep 24 '18 at 05:36

#!/usr/bin/python

count = 0

f = open('last_line1','r')

for line in f.readlines():

    line = line.strip()

    count = count + 1

    print line

print count

f.close()

count1 = 0

h = open('last_line1','r')

for line in h.readlines():

    line = line.strip()

    count1 = count1 + 1

    if count1 == count:

       print line         #-------------------- this is the last line

h.close()

Efficiently finding the last line in a text file

11 Answers11

Linked