3

I'd like to get the last line from a big gzipped log file, without having to iterate on all other lines, because it's a big file.

I have read Print Last Line of File Read In with Python and in particular this answer for big files, but it does not work for gzipped file. Indeed, I tried:

import gzip

with gzip.open(f, 'rb') as g:
    g.seek(-2, os.SEEK_END) 
    while g.read(1) != b'\n':  # Keep reading backward until you find the next break-line
        g.seek(-2, os.SEEK_CUR) 
    print(g.readline().decode())

but it already takes more than 80 seconds for a 10 MB compressed / 130 MB decompressed file, on my very standard laptop!

Question: how to seek efficiently to the last line in a gzipped file, with Python?


Side-remark: if not gzipped, this method is very fast: 1 millisecond for a 130 MB file:

import os, time
t0 = time.time()
with open('test', 'rb') as g:
    g.seek(-2, os.SEEK_END) 
    while g.read(1) != b'\n': 
        g.seek(-2, os.SEEK_CUR) 
    print(g.readline().decode())
print(time.time() - t0)    
Basj
  • 41,386
  • 99
  • 383
  • 673

2 Answers2

3

If you have no control over the generation of the gzip file, then there is no way to read the last line of the uncompressed data without decoding all of the lines. The time it takes will be O(n), where n is the size of the file. There is no way to make it O(1).

If you do have control on the compression end, then you can create a gzip file that facilitates random access, and you can also keep track of random access entry points to enable jumping to the end of the file.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • The API of the `gzip` module probably doesn't support this, but theoritically would the `gzip` algorithm support to read the bytes one by one, *backwards*, from the end? If so, the time to read the *last* line should be equivalent to the time to read the first line, is that correct? – Basj Sep 09 '21 at 18:39
  • Thanks for your answer @MarkAdler. I did not know I was speaking with the creator of GNU gzip and zlib, my respects! – Basj Sep 10 '21 at 07:17
2

The slowness is probably due to the many calls of seek in the loop.

So this solution with only one seek works:

with gzip.open(f, 'rb') as g:
    g.seek(-1000, os.SEEK_END)  # go 1000 bytes before end
    l = g.readlines()[-1].decode() # the last line

Note that:

  • g.readlines() is fast here, because it only splits the last 1000 bytes into lines
  • change 1000 according to the longest line that could occur in your files

Still looking for a better solution. This is linked but does not give a real solution to get the last line: Lazy Method for Reading Big File in Python?

Basj
  • 41,386
  • 99
  • 383
  • 673
  • The problem is that you can't interpret the end of the compressed data without knowing what comes before it. That is the tradeoff you make with compression: you sacrifice access time in return for a saving of space. – BoarGules Sep 09 '21 at 11:10
  • @BoarGules Reading a single line from start is very fast (with `for line in g: break`): it reads bytes until `\n` is reached (more or less). So technically there should be a way to do the same backwards: read from the end, bytes after bytes, and stop when `\n` is there. Reading from the end should be as fast as reading from start, technically. – Basj Sep 09 '21 at 11:13
  • @DarkKnight if it's not gzipped, no, I don't think so: we can just move the cursor to the EOF, and read one byte in loop in reverse order (file seek -1 byte from current position), and stop when we encounter `\n`. This should be the same speed as reading the first line. – Basj Sep 10 '21 at 07:20
  • @DarkKnight I just did 1 minute ago, I confirm that, if NOT gzipped, this method is very fast: 1 millisecond for a 130 MB file. I just updated the question to add this code for non-gzipped situation. – Basj Sep 10 '21 at 07:27
  • @Basj the case is trivial for raw data, not so for compressed data because DEFLATE means all the data of a block depends on the block declaration and the data preceding it (in the block). And DEFLATE streams are *bit*streams where blocks have a very simple 3 *bits* header, so a deflate stream is not [self-synchronising](https://en.wikipedia.org/wiki/Self-synchronizing_code): from a random point in the stream you've got no way to discover there the current block started, or where the next one starts. – Masklinn Sep 10 '21 at 07:34