52

I need to extract the last line from a number of very large (several hundred megabyte) text files to get certain data. Currently, I am using python to cycle through all the lines until the file is empty and then I process the last line returned, but I am certain there is a more efficient way to do this.

What is the best way to retrieve just the last line of a text file using python?

James Waldby - jwpat7
  • 8,593
  • 2
  • 22
  • 37
TimothyAWiseman
  • 14,385
  • 12
  • 40
  • 47

11 Answers11

55

Not the straight forward way, but probably much faster than a simple Python implementation:

line = subprocess.check_output(['tail', '-1', filename])
sth
  • 222,467
  • 53
  • 283
  • 367
46
with open('output.txt', 'r') as f:
    lines = f.read().splitlines()
    last_line = lines[-1]
    print last_line
bfontaine
  • 18,169
  • 13
  • 73
  • 107
mick barry
  • 613
  • 5
  • 2
13

Use the file's seek method with a negative offset and whence=os.SEEK_END to read a block from the end of the file. Search that block for the last line end character(s) and grab all the characters after it. If there is no line end, back up farther and repeat the process.

def last_line(in_file, block_size=1024, ignore_ending_newline=False):
    suffix = ""
    in_file.seek(0, os.SEEK_END)
    in_file_length = in_file.tell()
    seek_offset = 0

    while(-seek_offset < in_file_length):
        # Read from end.
        seek_offset -= block_size
        if -seek_offset > in_file_length:
            # Limit if we ran out of file (can't seek backward from start).
            block_size -= -seek_offset - in_file_length
            if block_size == 0:
                break
            seek_offset = -in_file_length
        in_file.seek(seek_offset, os.SEEK_END)
        buf = in_file.read(block_size)

        # Search for line end.
        if ignore_ending_newline and seek_offset == -block_size and buf[-1] == '\n':
            buf = buf[:-1]
        pos = buf.rfind('\n')
        if pos != -1:
            # Found line end.
            return buf[pos+1:] + suffix

        suffix = buf + suffix

    # One-line file.
    return suffix

Note that this will not work on things that don't support seek, like stdin or sockets. In those cases, you're stuck reading the whole thing (like the tail command does).

Mike DeSimone
  • 41,631
  • 10
  • 72
  • 96
  • 1
    I think this answer only works properly in Python 2. At least, it didn't work for me in Python 3, because you can't seek relative from the end of a text file in Python 3 (throws an io exception). To update this to Python 3: use a binary file, then you have to use byte arrays instead of strings for `buf` (make sure you compare `buf[-1:] == b'\n'`). You can use `suffix.decode('utf-8')` to return a string, if you're sure it's utf-8 encoded. – Multihunter Jan 07 '20 at 02:04
8

If you do know the maximal length of a line, you can do

def getLastLine(fname, maxLineLength=80):
    fp=file(fname, "rb")
    fp.seek(-maxLineLength-1, 2) # 2 means "from the end of the file"
    return fp.readlines()[-1]

This works on my windows machine. But I do not know what happens on other platforms if you open a text file in binary mode. The binary mode is needed if you want to use seek().

rocksportrocker
  • 7,251
  • 2
  • 31
  • 48
  • 2
    And if you don't know the maximum line length? – Adam Rosenfield Aug 23 '11 at 20:28
  • 1
    both this and mike's answer are "the right way to do it", but have issues for anything other than simple (single byte, eg ASCII) text encodings. unicode can have multi-byte characters, so in that case (1) you don't know the relative offset in bytes for a given maximum length in characters and (2) you may seek into "the middle" of a character. – andrew cooke Aug 23 '11 at 20:31
  • @Adam, you can usually pick a number that is greater than any reasonable line length even if it isn't a guaranteed maximum. If you absolutely can't make any assumptions or accept a truncated line, you have no choice but to read the whole file. – Mark Ransom Aug 23 '11 at 20:34
  • 1
    @andrew, the end-of-line byte code in UTF-8 will still be unique even if you start in the middle of a character. That's one of the beauties of UTF-8. – Mark Ransom Aug 23 '11 at 20:35
  • @andrew, Windows' multi-byte line end causes more issues than UTF-8. I'll amend my answer to support it if someone needs it. – Mike DeSimone Aug 23 '11 at 21:14
  • @andrew cooke: Only UTF-16 and UTF-32 can have a CR or LF byte inside a character. They are fixed-length encodings. No designer of a variable-byte-count encoding has been silly enough to have CR or LF byte-values inside a character. – John Machin Aug 23 '11 at 21:27
  • that wasn't exactly my point; what i didn't know was how well utf-8 could synchronize mid-stream. but apparently it's not an issue. which is great. – andrew cooke Aug 24 '11 at 01:10
  • 1
    @andrew: UTF-8 can sync midstream because the bytes in the representation of a code point >= U+80 all have the high bit set. Therefore, if the high bit is clear, it's a low-ASCII character. This makes us parser writers happy. On the other hand, there are formats such as Shift-JIS, which encode non-low-ASCII characters as two bytes, but only the first byte is guaranteed to have a high bit set. Luckily, they didn't use control characters for the second byte. – Mike DeSimone Aug 24 '11 at 05:02
  • @Mike: It's not just that characters below U+80 have the high bit clear -- UTF-8 can resync midstream because because there is no overlap between the set of possible *initial* bytes and the set of possible non-initial bytes in all characters. The possible initial bytes all begin with 0 or 11, whereas the possible non-initial bytes begin with 10. – Adam Rosenfield Aug 24 '11 at 21:16
  • @Adam: When I said "the bytes in the representation of a code point >= U+80" I was referring to both the initial bytes and the non-initial bytes. (Plural "bytes" with singular "code point".) Sorry for the confusion. – Mike DeSimone Aug 25 '11 at 03:28
  • 1
    file() is not supported in Python 3 Use open() instead; – Ludo Schmidt Aug 07 '20 at 09:50
7

If you can pick a reasonable maximum line length, you can seek to nearly the end of the file before you start reading.

myfile.seek(-max_line_length, os.SEEK_END)
line = myfile.readlines()[-1]
Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
5

Seek to the end of the file minus 100 bytes or so. Do a read and search for a newline. If here is no newline, seek back another 100 bytes or so. Lather, rinse, repeat. Eventually you'll find a newline. The last line begins immediately after that newline.

Best case scenario you only do one read of 100 bytes.

Bryan Oakley
  • 370,779
  • 53
  • 539
  • 685
2

The inefficiency here is not really due to Python, but to the nature of how files are read. The only way to find the last line is to read the file in and find the line endings. However, the seek operation may be used to skip to any byte offset in the file. You can, therefore begin very close to the end of the file, and grab larger and larger chunks as needed until the last line ending is found:

from os import SEEK_END

def get_last_line(file):
  CHUNK_SIZE = 1024 # Would be good to make this the chunk size of the filesystem

  last_line = ""

  while True:
    # We grab chunks from the end of the file towards the beginning until we 
    # get a new line
    file.seek(-len(last_line) - CHUNK_SIZE, SEEK_END)
    chunk = file.read(CHUNK_SIZE)

    if not chunk:
      # The whole file is one big line
      return last_line

    if not last_line and chunk.endswith('\n'):
      # Ignore the trailing newline at the end of the file (but include it 
      # in the output).
      last_line = '\n'
      chunk = chunk[:-1]

    nl_pos = chunk.rfind('\n')
    # What's being searched for will have to be modified if you are searching
    # files with non-unix line endings.

    last_line = chunk[nl_pos + 1:] + last_line

    if nl_pos == -1:
      # The whole chunk is part of the last line.
      continue

    return last_line
Zack Bloom
  • 8,309
  • 2
  • 20
  • 27
1

Here's a slightly different solution. Instead of multi-line, I focused on just the last line, and instead of a constant block size, I have a dynamic (doubling) block size. See comments for more info.

# Get last line of a text file using seek method.  Works with non-constant block size.  
# IDK if that speed things up, but it's good enough for us, 
# especially with constant line lengths in the file (provided by len_guess), 
# in which case the block size doubling is not performed much if at all.  Currently,
# we're using this on a textfile format with constant line lengths.
# Requires that the file is opened up in binary mode.  No nonzero end-rel seeks in text mode.
REL_FILE_END = 2
def lastTextFileLine(file, len_guess=1):
    file.seek(-1, REL_FILE_END)      # 1 => go back to position 0;  -1 => 1 char back from end of file
    text = file.read(1)
    tot_sz = 1              # store total size so we know where to seek to next rel file end
    if text != b'\n':        # if newline is the last character, we want the text right before it
        file.seek(0, REL_FILE_END)    # else, consider the text all the way at the end (after last newline)
        tot_sz = 0
    blocks = []           # For storing succesive search blocks, so that we don't end up searching in the already searched
    j = file.tell()          # j = end pos
    not_done = True
    block_sz = len_guess
    while not_done:
        if j < block_sz:   # in case our block doubling takes us past the start of the file (here j also = length of file remainder)
            block_sz = j
            not_done = False
        tot_sz += block_sz
        file.seek(-tot_sz, REL_FILE_END)         # Yes, seek() works with negative numbers for seeking backward from file end
        text = file.read(block_sz)
        i = text.rfind(b'\n')
        if i != -1:
            text = text[i+1:].join(reversed(blocks))
            return str(text)
        else:
            blocks.append(text)
            block_sz <<= 1    # double block size (converge with open ended binary search-like strategy)
            j = j - block_sz      # if this doesn't work, try using tmp j1 = file.tell() above
    return str(b''.join(reversed(blocks)))      # if newline was never found, return everything read

Ideally, you'd wrap this in a class LastTextFileLine and keep track of a moving average of line lengths. This would give you a good len_guess maybe.

0

Could you load the file into a mmap, then use mmap.rfind(string[, start[, end]]) to find the second last EOL character in the file? A seek to that point in the file should point you to the last line I would think.

ChrisC
  • 1,282
  • 9
  • 9
-3
lines = file.readlines()
fileHandle.close()
last_line = lines[-1]
Jon Martin
  • 3,252
  • 5
  • 29
  • 45
  • 2
    Gah! Don't ever do `lines[len(lines) -1]`. That's an `O(n)` operation. `lines[-1]` will get the last one. Besides, this isn't any better than the approach he's already using. – g.d.d.c Aug 23 '11 at 20:20
  • Oops, my mistake! This method actually is more efficient though. – Jon Martin Aug 23 '11 at 20:21
  • 11
    @g.d.d.c: `lines[len(lines)-1]` is not O(n) (unless `lines` is a user-defined type with an O(n) implementation of `__len__`, but that's not the case here). While it's bad style, `lines[len(lines)-1]` has a practically identical runtime cost as `lines[-1]`; the only difference is whether the index calculation is done explicitly in script or implicitly by the runtime. – Adam Rosenfield Aug 23 '11 at 20:23
  • This, however, sounds very memory inefficient, as you have to read a possibly large file into memory before performing said `O(1)` operation. – gustafbstrom Oct 19 '18 at 10:37
-6
#!/usr/bin/python

count = 0

f = open('last_line1','r')

for line in f.readlines():

    line = line.strip()

    count = count + 1

    print line

print count

f.close()

count1 = 0

h = open('last_line1','r')

for line in h.readlines():

    line = line.strip()

    count1 = count1 + 1

    if count1 == count:

       print line         #-------------------- this is the last line

h.close()
Lê Tư Thành
  • 1,063
  • 2
  • 10
  • 19