4

I am parsing log files in size of 1 to 10GB using python3.2, need to search for line with specific regex (some kind of timestamp), and I want to find the last occurance.

I have tried to use:

for line in reversed(list(open("filename")))

which resulted in very bad performance (in the good cases) and MemoryError in the bad cases.

In thread: Read a file in reverse order using python i did not find any good answer.

I have found the following solution: python head, tail and backward read by lines of a text file very promising, however it does not work for python3.2 for error:

NameError: name 'file' is not defined

I had later tried to replace File(file) with File(TextIOWrapper) as this is the object builtin function open() returns, however that had resulted in several more errors (i can elaborate if someone suggest this is the right way:))

Community
  • 1
  • 1
Noam Inbar
  • 133
  • 7
  • 2
    You may check http://stackoverflow.com/questions/3568833/how-to-read-lines-from-a-file-in-python-starting-from-the-end for several solutions to see if any of them fit your case. – Selcuk Mar 09 '14 at 18:42
  • Take a look at : http://stackoverflow.com/questions/260273/most-efficient-way-to-search-the-last-x-lines-of-a-file-in-python/260433#260433 – Loïc G. Mar 09 '14 at 18:44
  • @LoïcG.thanks, this solution also seems promising, already checked before, however it also expects `file` and not TextIOWrapper, and when passing it open('some_file.txt') and did not work as well. – Noam Inbar Mar 09 '14 at 18:56
  • Please post your code. You may use `with open(...) as file: ...` – Loïc G. Mar 09 '14 at 18:57

2 Answers2

2

If you don't want to read the whole file you can always use seek. Here is a demo:

 $ cat words.txt 
foo
bar
baz
[6] oz123b@debian:~ $ ls -l words.txt 
-rw-r--r-- 1 oz123 oz123 12 Mar  9 19:38 words.txt

The file size is 12 bytes. You can skip to the last entry by moving the cursor 8 bites forward:

In [3]: w=open("words.txt")
In [4]: w.seek(8)
In [5]: w.readline()
Out[5]: 'baz\n'

To complete my answer, here is how you print these lines in reverse:

 w=open('words.txt')

In [6]: for s in [8, 4, 0]:
   ...:     _= w.seek(s)
   ...:     print(w.readline().strip())
   ...:     
baz
bar
foo

You will have to explore you file's data structure and the size of each line. Mine was quite simple, because it was meant to demonstrate the principle.

oz123
  • 27,559
  • 27
  • 125
  • 187
2

This is a function that does what you're looking for

def reverse_lines(filename, BUFSIZE=4096):
    f = open(filename, "rb")
    f.seek(0, 2)
    p = f.tell()
    remainder = ""
    while True:
        sz = min(BUFSIZE, p)
        p -= sz
        f.seek(p)
        buf = f.read(sz) + remainder
        if '\n' not in buf:
            remainder = buf
        else:
            i = buf.index('\n')
            for L in buf[i+1:].split("\n")[::-1]:
                yield L
            remainder = buf[:i]
        if p == 0:
            break
    yield remainder

it works by reading a buffer from the end of the file (by default 4kb) and generating all the lines in it in reverse. It then moves back by 4k and does the same until the beginning of the file. The code may need to keep more than 4k in memory in case there are no linefeed in the section being processed (very long lines).

You can use the code as

for L in reverse_lines("my_big_file"):
   ... process L ...
6502
  • 112,025
  • 15
  • 165
  • 265
  • thanks! this functions indeed does what i was looking for. Small issue though: using "rb" for open led to an error: `File "/home/noami/utils/doawasup.py", line 20, in reverse_lines buf = f.read(sz) + remainder TypeError: can't concat bytes to str` but after removing the `b`inary it worked perfect. – Noam Inbar Mar 09 '14 at 23:42