1

I have a text file which is >300GB in size originally, and gzipped it still has >10GB. (it is a database export which ran for days, and then was aborted, and I want to know the timestamp of the last exported entry so I can resume the export.)

I am interested in the last few lines of this text file, preferably without having to unzip the whole 300GB (even into memory). This file does not grow any more so I don't need to track changes or appended data a.k.a tail -f.

Is there a way to gunzip only the last part of the file?

tail --bytes=10000000 /mnt/myfile.db.gz | gunzip - |less

does not work (it returns stdin: not in gzip format). Since gzip can compress not just files, but also streams of data, it should be possible to search for an entry point somewhere in the file where to start uncompressing, without having to read the file header. Right?

Toto
  • 89,455
  • 62
  • 89
  • 125
Jens
  • 1,386
  • 14
  • 31
  • 2
    What about this? https://stackoverflow.com/questions/22533060/how-to-zgrep-the-last-line-of-a-gz-file-without-tail – Dominique Jun 09 '21 at 07:11
  • 1
    https://unix.stackexchange.com/questions/429197/reading-partially-downloaded-gzip-with-an-offset `Since gzip can compress not just files, but also streams of data, it should be possible to search` It does not matter if it's a file or a stream, it has to have a gzip header in front. A stream also has the first thing a gzip header with the magic number. – KamilCuk Jun 09 '21 at 08:01

2 Answers2

1

No, not right. Unless the gzip stream was specially generated to allow random access, the only way to decode the last few lines is to decode the whole thing.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • That's very unfortunate, but if gzip is designed that way, well. Do you have a suggestion what other large-file / stream compressor I could prefer next time, preferably with low CPU load because the whole thing runs on a Raspberry Pi? – Jens Sep 09 '21 at 11:00
  • If you are in control of the compression side of things (it sounded like you weren't), you can use pigz with the `--independent` option to create independently decompressible blocks in the gzip stream. Or you could use bzip2, which naturally does that, but take a fair bit more CPU on both the compression and decompression. – Mark Adler Sep 09 '21 at 16:37
  • I was, but used gzip intentionally because bzip2 was multiple times slower and loaded the machine far too much. But in the future, if this happens again, I'll try pigz. Thanks! – Jens Sep 10 '21 at 21:40
0

Quick followup on my own question: This is not possible using gzip without hackery (there are patches for gzip which compress in chunks and you can decode each chunk independently).

BUT you can use xz, and using the lowest compression ration (-0) the CPU load is comparable to gzip and compression as well. And xz can actually decompress parts of a compressed file.

I will consider this for the future.

Jens
  • 1,386
  • 14
  • 31