Most efficient way to search the last X lines of a file?

Question

I have a file and I don't know how big it's going to be (it could be quite large, but the size will vary greatly). I want to search the last 10 lines or so to see if any of them match a string. I need to do this as quickly and efficiently as possible and was wondering if there's anything better than:

s = "foo"
last_bit = fileObj.readlines()[-10:]
for line in last_bit:
    if line == s:
        print "FOUND"

Exact Duplicate of http://stackoverflow.com/questions/136168/tail-a-file-with-python. — S.Lott, Nov 04 '08 at 02:05

score 37 · Accepted Answer · edited Sep 25 '09 at 18:53

37

# Tail
from __future__ import with_statement

find_str = "FIREFOX"                    # String to find
fname = "g:/autoIt/ActiveWin.log_2"     # File to check

with open(fname, "r") as f:
    f.seek (0, 2)           # Seek @ EOF
    fsize = f.tell()        # Get Size
    f.seek (max (fsize-1024, 0), 0) # Set pos @ last n chars
    lines = f.readlines()       # Read to end

lines = lines[-10:]    # Get last 10 lines

# This returns True if any line is exactly find_str + "\n"
print find_str + "\n" in lines

# If you're searching for a substring
for line in lines:
    if find_str in line:
        print True
        break

edited Sep 25 '09 at 18:53

jfs

399,953
195
994
1,670

answered Nov 03 '08 at 23:40

PabloG

25,761
10
46
59

1

The "if len(l) < 10" is redundant. "print l[:-10]" handles that case. – Darius Bacon Nov 04 '08 at 00:35
@Darius: I really meant if len(l) > 10, fixed – PabloG Nov 04 '08 at 01:16
2

lines[:-10] drops the last 10 lines. What you want is lines[-10:]. – Markus Jarderot Nov 04 '08 at 01:20
@MizardX / @ΤΖΩΤΖΙΟΥ: you're right, of course. Thx for the bugfix/comment – PabloG Nov 04 '08 at 11:31
Replace `if l.find(find_str):` with `if l.find(find_str) >= 0:` or better yet by `if find_str in l:`. – jfs Sep 24 '09 at 21:47
`if len(lines) > 10:` is redundant. `list("ab")[-10:] -> ['a', 'b']` – jfs Sep 24 '09 at 21:49
I've replaced `if l.find(find_str)` by `if find_str in line` – jfs Sep 25 '09 at 18:53
I've removed redundant `if len(lines) > 10` – jfs Sep 25 '09 at 18:56
13

This will fail if the file has very long lines. The code assumes that the last 10 lines falls within the last 1k of data. There should be a check that there are at least 11 lines, or continue to seek backwards until that condition is true. – Bryan Oakley Mar 13 '10 at 13:09
Hi guys I have a similar problem.. Could u have a look please? @jfs https://stackoverflow.com/questions/64914906/searching-backwards-trough-a-text-file-adding-new-line-after-specific-string – Nov 20 '20 at 08:48

score 35 · Answer 2 · edited Sep 29 '21 at 05:03

35

Here's an answer like MizardX's, but without its apparent problem of taking quadratic time in the worst case from rescanning the working string repeatedly for newlines as chunks are added.

Compared to the Active State solution (which also seems to be quadratic), this doesn't blow up given an empty file and does one seek per block read instead of two.

Compared to spawning 'tail', this is self-contained. (But 'tail' is best if you have it.)

Compared to grabbing a few kB off the end and hoping it's enough, this works for any line length.

import os

def reversed_lines(file):
    "Generate the lines of file in reverse order."
    part = ''
    for block in reversed_blocks(file):
        for c in reversed(block):
            if c == '\n' and part:
                yield part[::-1]
                part = ''
            part += c
    if part: yield part[::-1]

def reversed_blocks(file, blocksize=4096):
    "Generate blocks of file's contents in reverse order."
    file.seek(0, os.SEEK_END)
    here = file.tell()
    while 0 < here:
        delta = min(blocksize, here)
        here -= delta
        file.seek(here, os.SEEK_SET)
        yield file.read(delta)

To use it as requested:

from itertools import islice

def check_last_10_lines(file, key):
    for line in islice(reversed_lines(file), 10):
        if line.rstrip('\n') == key:
            print 'FOUND'
            break

Edit: changed map() to itertools.imap() in head(). Edit 2: simplified reversed_blocks(). Edit 3: avoid rescanning tail for newlines. Edit 4: rewrote reversed_lines() because str.splitlines() ignores a final '\n', as BrianB noticed (thanks).

Note that in very old Python versions the string concatenation in a loop here will take quadratic time. CPython from at least the last few years avoids this problem automatically.

edited Sep 29 '21 at 05:03

Aman

343
2
8

answered Nov 04 '08 at 00:19

Darius Bacon

14,921
5
53
53

Very nice -- I read down the list of answers until I got here, knowing that the best one would be whichever was savvy enough to use the `yield` directive – Brian B Dec 04 '12 at 23:08
Fixed a corner case for ya -- sometimes a block ends in a newline, so the tail is its own entry. – Brian B Dec 05 '12 at 21:12
@BrianB, thanks -- can you give a test case where my code breaks? I've reverted your change because it failed on the first thing I tried, '\nhello\n\nworld\n' (with blocksize set to 2). (My thanks are not ironic because I expect you noticed a real case where my code failed instead.) – Darius Bacon Dec 09 '12 at 19:45
@BrianB, I think I see what you saw, and it seems nicest to rewrite the whole function, alas. Done. – Darius Bacon Dec 09 '12 at 20:49
Works, but it is about half as fast as the previous version (as corrected). – Brian B Dec 13 '12 at 15:02
Do Jython, Pypy avoid the quadratic runtime for `part += c`? – jfs Apr 13 '14 at 13:24
Probably not, though I haven't tried. It'd be nice to see a clean/fast version without that issue; obviously we could change `part` to a list of characters and join it on the yields, but can we do better? – Darius Bacon Apr 14 '14 at 19:50
PyPy does not avoid quadratic time. – Veedrac Sep 28 '14 at 05:36
1

Spliting the lines by '\n' yields a 30% faster implmementation: def reversed_lines(self): """Generate the lines of file in reverse order. """ part = '' for block in self.reversed_blocks(): block = block + part block = block.split('\n') block.reverse() part = block.pop() if block[0] == '': block.pop(0) for line in block: yield line + '\n' if part: yield part I will post a io.BaseIO wrapper for anyone below. – asterio gonzalez Aug 08 '18 at 15:46
@DariusBacon I have a similar problem, I cant solve it :( Could u have a look? https://stackoverflow.com/questions/64914906/searching-backwards-trough-a-text-file-adding-new-line-after-specific-string – Nov 20 '20 at 09:05

Ryan Ginstrom · Answer 3 · 2008-11-04T08:53:33.407

8

I think reading the last 2 KB or so of the file should make sure you get 10 lines, and shouldn't be too much of a resource hog.

file_handle = open("somefile")
file_size = file_handle.tell()
file_handle.seek(max(file_size - 2*1024, 0))

# this will get rid of trailing newlines, unlike readlines()
last_10 = file_handle.read().splitlines()[-10:]

assert len(last_10) == 10, "Only read %d lines" % len(last_10)

edited Nov 04 '08 at 08:53

answered Nov 03 '08 at 23:21

Ryan Ginstrom

13,915
5
45
60

you should double check that the file is >= 2KB though – Jake Nov 03 '08 at 23:24

score 8 · Answer 4 · answered Nov 04 '08 at 00:04

8

If you are running Python on a POSIX system, you can use 'tail -10' to retrieve the last few lines. This may be faster than writing your own Python code to get the last 10 lines. Rather than opening the file directly, open a pipe from the command 'tail -10 filename'. If you are certain of the log output though (for example, you know that there are never any very long lines that are hundreds or thousands of characters long) then using one of the 'read the last 2KB' approaches listed would be fine.

answered Nov 04 '08 at 00:04

Myrddin Emrys

42,126
11
38
51

I would be cautious with this, because shell calls have much more overhead than a direct access. – Svante Nov 04 '08 at 05:35
1

This is quite old, but I was not actually advocating a shell call. I was recommending calling the script with piped output from tail, rather than calling the script to read the whole file itself. – Myrddin Emrys Mar 22 '10 at 22:06

score 5 · Answer 5 · answered Nov 04 '08 at 05:21

5

Here is a version using mmap that seems pretty efficient. The big plus is that mmap will automatically handle the file to memory paging requirements for you.

import os
from mmap import mmap

def lastn(filename, n):
    # open the file and mmap it
    f = open(filename, 'r+')
    m = mmap(f.fileno(), os.path.getsize(f.name))

    nlcount = 0
    i = m.size() - 1 
    if m[i] == '\n': n += 1
    while nlcount < n and i > 0:
        if m[i] == '\n': nlcount += 1
        i -= 1
    if i > 0: i += 2

    return m[i:].splitlines()

target = "target string"
print [l for l in lastn('somefile', 10) if l == target]

answered Nov 04 '08 at 05:21

mhawke

84,695
9
117
138

1

Nice! I should've thought of mmap. This goes an order of magnitude slower than mine on my test of a really big 1-line file, though, I guess because it checks char-by-char in Python code. – Darius Bacon Nov 04 '08 at 05:57
Yes, I was worried about the "pure Python" loop too. The loop could possibly be made more efficient than the code I've provided. If the mmap object had an rfind() method, it could have been much better! – mhawke Nov 04 '08 at 23:00
FYI: Python v2.6.5's mmap objects have an `rfind()` method. – RobM Feb 10 '11 at 17:41
`tail -r` also uses `mmap` for regular files (see [`r_reg()` function](http://svnweb.freebsd.org/base/head/usr.bin/tail/reverse.c?revision=216370&view=markup#104)) – jfs Nov 06 '13 at 05:45

Edd · Answer 6 · 2013-07-24T17:50:05.183

I took mhawke's suggestion to use mmap and wrote a version that uses rfind:

from mmap import mmap
import sys

def reverse_file(f):
    mm = mmap(f.fileno(), 0)
    nl = mm.size() - 1
    prev_nl = mm.size()
    while nl > -1:
        nl = mm.rfind('\n', 0, nl)
        yield mm[nl + 1:prev_nl]
        prev_nl = nl + 1

def main():
    # Example usage
    with open('test.txt', 'r+') as infile:
        for line in reverse_file(infile):
            sys.stdout.write(line)

score 2 · Answer 7 · answered Nov 03 '08 at 23:35

2

I think I remember adapting the code from this blog post from Manu Garg when I had to do something similar.

answered Nov 03 '08 at 23:35

Daryl Spitzer

143,156
76
154
173

score 2 · Answer 8 · answered Nov 04 '08 at 00:08

If you're on a unix box, os.popen("tail -10 " + filepath).readlines() will probably be the fastest way. Otherwise, it depends on how robust you want it to be. The methods proposed so far will all fall down, one way or another. For robustness and speed in the most common case you probably want something like a logarithmic search: use file.seek to go to end of the file minus 1000 characters, read it in, check how many lines it contains, then to EOF minus 3000 characters, read in 2000 characters, count the lines, then EOF minus 7000, read in 4000 characters, count the lines, etc. until you have as many lines as you need. But if you know for sure that it's always going to be run on files with sensible line lengths, you may not need that.

You might also find some inspiration in the source code for the unix tail command.

score 2 · Answer 9 · answered Nov 04 '08 at 02:37

I ran into that problem, parsing the last hour of LARGE syslog files, and used this function from activestate's recipe site... (http://code.activestate.com/recipes/439045/)

!/usr/bin/env python
# -*-mode: python; coding: iso-8859-1 -*-
#
# Copyright (c) Peter Astrand <astrand@cendio.se>

import os
import string

class BackwardsReader:
    """Read a file line by line, backwards"""
    BLKSIZE = 4096

    def readline(self):
        while 1:
            newline_pos = string.rfind(self.buf, "\n")
            pos = self.file.tell()
            if newline_pos != -1:
                # Found a newline
                line = self.buf[newline_pos+1:]
                self.buf = self.buf[:newline_pos]
                if pos != 0 or newline_pos != 0 or self.trailing_newline:
                    line += "\n"
                return line
            else:
                if pos == 0:
                    # Start-of-file
                    return ""
                else:
                    # Need to fill buffer
                    toread = min(self.BLKSIZE, pos)
                    self.file.seek(-toread, 1)
                    self.buf = self.file.read(toread) + self.buf
                    self.file.seek(-toread, 1)
                    if pos - toread == 0:
                        self.buf = "\n" + self.buf

    def __init__(self, file):
        self.file = file
        self.buf = ""
        self.file.seek(-1, 2)
        self.trailing_newline = 0
        lastchar = self.file.read(1)
        if lastchar == "\n":
            self.trailing_newline = 1
            self.file.seek(-1, 2)

# Example usage
br = BackwardsReader(open('bar'))

while 1:
    line = br.readline()
    if not line:
        break
    print repr(line)

It works really well and is much more efficient then anything like fileObj.readlines()[-10:], which makes python read the entire file into memory and then chops the last ten lines off of it.

Fails if bar is empty. I have some trouble following the code, too. — Darius Bacon, Nov 04 '08 at 04:07

score 2 · Answer 10 · answered Aug 08 '18 at 15:50

Thanks to the solution by 18 Darius Bacon but with a 30% faster implementation and wrapping into io.BaseIO class.

class ReverseFile(io.IOBase):
    def __init__ (self, filename, headers=1):
        self.fp = open(filename)
        self.headers = headers
        self.reverse = self.reversed_lines()
        self.end_position = -1
        self.current_position = -1

    def readline(self, size=-1):
        if self.headers > 0:
            self.headers -= 1
            raw = self.fp.readline(size)
            self.end_position = self.fp.tell()
            return raw

        raw = next(self.reverse)
        if self.current_position > self.end_position:
            return raw

        raise StopIteration

    def reversed_lines(self):
        """Generate the lines of file in reverse order.
        """
        part = ''
        for block in self.reversed_blocks():
            block = block + part
            block = block.split('\n')
            block.reverse()
            part = block.pop()
            if block[0] == '':
                block.pop(0)

            for line in block:
                yield line + '\n'

        if part:
            yield part

    def reversed_blocks(self, blocksize=0xFFFF):
        "Generate blocks of file's contents in reverse order."
        file = self.fp
        file.seek(0, os.SEEK_END)
        here = file.tell()
        while 0 < here:
            delta = min(blocksize, here)
            here -= delta
            file.seek(here, os.SEEK_SET)
            self.current_position = file.tell()
            yield file.read(delta)

An example

rev = ReverseFile(filename)
for i, line in enumerate(rev):
        print("{0}: {1}".format(i, line.strip()))

score 1 · Answer 11 · answered Nov 03 '08 at 23:06

1

You could read chunks of 1,000 bytes or so from the end of the file into a buffer until you have 10 lines.

answered Nov 03 '08 at 23:06

Robert Gamble

106,424
25
145
137

score 1 · Answer 12 · edited Sep 24 '09 at 21:08

You could also count the lines as you reverse through the file, instead of guessing at a byte offset.

lines = 0
chunk_size = 1024

f = file('filename')
f.seek(0, 2)
f.seek(f.tell() - chunk_size)

while True:
    s = f.read(chunk_size)
    lines += s.count('\n')
    if lines > NUM_OF_LINES:
        break
    f.seek(f.tell() - chunk_size*2)

Now the file is at a good position to run readlines(). You also could cache the strings you read the first time, to eliminate reading the same portion of the file twice.

score 0 · Answer 13 · edited Mar 13 '10 at 13:04

0

Maybe this might be useful:

import os.path

path = 'path_to_file'
os.system('tail -n1 ' + path)

edited Mar 13 '10 at 13:04

sth

222,467
53
283
367

answered Mar 12 '10 at 23:30

AM01

302
3
18

could be one way, but its not portable. – ghostdog74 Mar 13 '10 at 13:31

score 0 · Answer 14 · answered Nov 03 '08 at 23:05

0

read the last few Ks of the file, and split that into lines to return only the last 10.

it's quite unlikely the start of that chunk to fall on a line boundary, but you'll discard the first lines anyway.

answered Nov 03 '08 at 23:05

Javier

60,510
8
78
126

score 0 · Answer 15 · answered Nov 03 '08 at 23:27

0

Personally I'd be tempted to break out to the shell and call tail -n10 to load the file. But then I'm not really a Python programmer ;)

answered Nov 03 '08 at 23:27

Gareth

133,157
36
148
157

Markus Jarderot · Answer 16 · 2008-11-04T00:59:55.640

First, a function that returns a list:

def lastNLines(file, N=10, chunksize=1024):
    lines = None
    file.seek(0,2) # go to eof
    size = file.tell()
    for pos in xrange(chunksize,size-1,chunksize):
        # read a chunk
        file.seek(pos,2)
        chunk = file.read(chunksize)
        if lines is None:
            # first time
            lines = chunk.splitlines()
        else:
            # other times, update the 'first' line with
            # the new data, and re-split
            lines[0:1] = (chunk + lines[0]).splitlines()
        if len(lines) > N:
            return lines[-N:]
    file.seek(0)
    chunk = file.read(size-pos)
    lines[0:1] = (chunk + lines[0]).splitlines()
    return lines[-N:]

Second, a function that iterates over the lines in reverse order:

def iter_lines_reversed(file, chunksize=1024):
    file.seek(0,2)
    size = file.tell()
    last_line = ""
    for pos in xrange(chunksize,size-1,chunksize):
        # read a chunk
        file.seek(pos,2)
        chunk = file.read(chunksize) + last_line
        # split into lines
        lines = chunk.splitlines()
        last_line = lines[0]
        # iterate in reverse order
        for index,line in enumerate(reversed(lines)):
            if index > 0:
                yield line
    # handle the remaining data at the beginning of the file
    file.seek(0)
    chunk = file.read(size-pos) + last_line
    lines = chunk.splitlines()
    for line in reversed(lines):
        yield line

For your example:

s = "foo"
for index, line in enumerate(iter_lines_reversed(fileObj)):
    if line == s:
        print "FOUND"
        break
    elif index+1 >= 10:
        break

Edit: Now gets the file-size automaticly
Edit2: Now only iterates for 10 lines.

A nit: you don't stop searching after checking 10 lines unsuccessfully. — Darius Bacon, Nov 04 '08 at 00:54
Yeah, that's cool. See my answer for another way to do it, more reusable. — Darius Bacon, Nov 04 '08 at 02:00
Although, looking at it now, I realize my head() function goes through all 10 lines even if the key is found sooner; I should've used itertools.imap() instead of map(). — Darius Bacon, Nov 04 '08 at 02:03

score 0 · Answer 17 · answered Nov 04 '08 at 18:50

This solution will read the file only once, but using 2 file object pointers to be able obtain the last N lines of file without re-reading it:

def getLastLines (path, n):
    # return the las N lines from the file indicated in path

    fp = open(path)
    for i in range(n):
        line = fp.readline()
        if line == '':
            return []

    back = open(path)
    for each in fp:
        back.readline()

    result = []
    for line in back:
        result.append(line[:-1])

    return result




s = "foo"
last_bit = getLastLines(r'C:\Documents and Settings\ricardo.m.reyes\My Documents\desarrollo\tail.py', 10)
for line in last_bit:
    if line == s:
        print "FOUND"

Trasp · Answer 18 · 2023-07-21T12:46:47.703

A quick and dirty solution to do the task stated in the title of of this question:

"foo" in deque(f, 10)

Checks if any of the last ten lines is "foo". File f is read front to back and separated by built-ins while deque only keeps the last 10 lines in memory.

Decent solution for UTF-8 / UTF-6 encoded files already opened in text mode, as the whole file needs to be when using f.seek, or if you you're just looking for a convenient oneliner.

Quote from docs.python.org:

If maxlen is not specified or is None, deques may grow to an arbitrary length. Otherwise, the deque is bounded to the specified maximum length. Once a bounded length deque is full, when new items are added, a corresponding number of items are discarded from the opposite end.

Wrapped as a function

from collections import deque

def tail(path: str, n: int = 10, mode: str = "r+") -> deque:
    """
    Reads a file at `path` in file mode `mode` and returns the last `n` files of that file as `deque`.
    """
    with open(path, mode) as f:
        return deque(f, n)

"foo" in tail("/path/to/file", 10)

Arpit · Answer 19 · 2021-09-29T13:18:39.690

This will return the last 10 lines as a list then you can search your line easily. (Python 3 compatible)

def read_last_n_lines_new(lines_need=10):

    with open('Log.txt', 'rb') as f:
        f.seek(0, 2)
        data = []
        lines_found = 0
        while True:
            try:
                f.seek(-1, 1)
            except:
                break
            finally:
                c = f.read(1)
                f.seek(-1, 1)
            if c == b'\n':
                lines_found = lines_found+1
            if lines_found > lines_need or not c:
                break
            data.insert(0, c.decode('utf-8'))
            
        
        lines = []
        cur = ""
        for l in data:
            if(l == '\n'):
                lines.append(cur)
                cur = ''
            else:
                cur = cur + l
        return lines

there is already a lot of answers to this question, is this one useful and/or better than the others ? — MarcMush, Sep 29 '21 at 13:25
@MarcMush Most of the answers read the file in chunks. I have read the file 1 byte at a time and the function stops exactly when the last n lines are read so I guess it should be fast. — Arpit, Sep 29 '21 at 17:12

Most efficient way to search the last X lines of a file?

19 Answers19

Linked