1

I have a log file that has data lines and some explanation text lines. I would like to read the last 10 data lines from a file. How can I do it in Python? I mean, is there faster way than use

for line in reversed(open("filename").readlines()):

and then parse the file. I guess it opens the whole file and is slow if the log file is huge. So is there a method to open just the end of the file and read data from it? All I need is the last 10 lines from a file having text ,Kes. If there are not 10 lines having ,Kes, it should return all lines having ,Kes in the same order those appeared in the file.

Jaakko Seppälä
  • 744
  • 2
  • 7
  • 21

4 Answers4

2

You have to cross over the first (N - 10) lines but you can do it in a smart way. The fact that you're consuming time doesn't mean that you have to consume memory as well. In your code you're using readlines() which read all the lines and returns a list of them. This is while the fileobject itself is an iterator-like object and you can use a container with restricted length and insert all the lines into it which at the end it will only preserve the last N lines. In python you can use a deque with its maxlen set to 10 for this sake:

from collections import deque

with open("filename") as f:
    last_ten_lines =  deque(f,maxlen=10)

Regarding your last point, if you want to filter the lines that have the word ,Kes the best way is to loop over the reverse of the file object.

from itertools import islice
def get_last_n(file_name, n=10):
""" Returns the last N filtered lines. """
    def loop_over():
        with open(file_name) as f:
            for line in reversed(f):
                if ",Kes" in line: 
                    yield line
    return islice(get_last_ten(), N)
Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • @ThomWiggers Not at once. – Mazdak Mar 22 '18 at 07:29
  • I just checked the source code: it will [consume the iterable](https://github.com/python/cpython/blob/master/Modules/_collectionsmodule.c#L470-L494). Same in [PyPy](https://bitbucket.org/pypy/pypy/src/default/lib_pypy/_collections.py?fileviewer=file-view-default#_collections.py-48:49). – Thom Wiggers Mar 22 '18 at 07:30
  • @OP: this answer is equivalent to the code you are writing: it will read all the lines of the file. – Thom Wiggers Mar 22 '18 at 07:33
  • @ThomWiggers it is not equivalent. the OPs solution reads all lines into a list, reverses the whole list and then starts to parse it. An iterable will only ever touch the next line - so it takes far less (by some magnitudes) ram in hand then the OPs. – Patrick Artner Mar 22 '18 at 07:47
  • 1
    Thats better then mine :) pity I could only +1 the -1 away. – Patrick Artner Mar 22 '18 at 07:55
1

You can

  • read all, store all in a list, reverse all and take first 10 lines that contain ,Kes
    • your approach - takes lots of storage and time
  • use Kasramvd's approach wich is francly far more elegant then this one - leveraging iterable and islice
  • read each line yourself and check if ,Kes in it, if so queue it:

from collections import deque

# create demodata
with open ("filename","w") as f:
    for n in range (20):
        for p in range(20):
            f.write("some line {}-{}\n".format(n,p))

        f.write("some line with {} ,Kes \n".format(n))

# read demodata
q = deque(maxlen=10)
with open("filename") as f:
    for line in f:           # read one line at time, not huge file at once
        if ',Kes' in line:   # only store line if Kes in it
            q.append(line)   # append line, size limit will make sure we store 10 at most

# print "remebered" data
print(list(q))

Output:

['some line with 10 ,Kes \n', 'some line with 11 ,Kes \n', 'some line with 12 ,Kes \n', 
 'some line with 13 ,Kes \n', 'some line with 14 ,Kes \n', 'some line with 15 ,Kes \n', 
 'some line with 16 ,Kes \n', 'some line with 17 ,Kes \n', 'some line with 18 ,Kes \n', 
 'some line with 19 ,Kes \n']

You will not have the whole file in RAM at once as well, at most 11 lines (curr line + deque holding 10 lines and it only remembers lines with ,Kes in it.

Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
1

Your proposed code is clearly not efficient:

  • you read the whole file into memory
  • you fully reverse the list of lines
  • only then you search the lines containing keyword.

I can imagine 2 possible algorithms:

  1. scan the file in forward order and store 10 lines containing the keyword, each new one replacing the older. Code could be more or less:

    to_keep = [None] * 10
    index = 0
    for line in file:
        if line.find(keyword) != -1:
            to_keep[index] = line
            index = (index + 1) % 10
    

    It should be acceptable if only few line in the file contain the keyword and if reading from the back would also need to load a great part of the file

  2. Read the file in chunks from the end and apply above algorithm on each chunk. It will be more efficient if keyword is frequent enough for only few chunks to be required, but will be slightly more complex: it is not possible to seek to lines but only to byte positions in a file, so you could start in the middle of a line or even in the middle of a multibyte character (think about UTF-8), so you should keep the first partial line and add it later to next chunk.

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
-1

import os os.popen('tail -n 10 filepath').read()