Get last line of a huge gziped file

Question

In a database backup process I generate a text dumpfile. As database is quite huge, dump file is huge too so I compresses it with gzip. Compression is done inline while dump is generated (thanks Unix pipe !).

At the process end, I check dump file validity by watching the last line and check the "Dump completed" string presence. In my script I do it by extracting last line into a variable:

str=`zcat ${PATHSAVE}/dumpFull.sql.gz | tail -n1`

As database dump file is huge (currently more than 200Gb) this end process check take huge time to run (currently more than 180 minutes).

I'm searching a way to extract quicker the last line of my .gz file ... any idea anyone ?

Note 1: For explain context, we can say database is MySql community, backup tool is mysqldump, generated dumpfile is a full text file. OS is CentOs. Backup script is Bash shell script.

Note 2: I'm aware about Percona xtraBackup but in my case I want to use mysqldump for this specific backup job. Time need for restauration is not an issue.

Can you change the way the gzip file is created? There's a subset of the gzip format that's referred to as "rsyncable" gzip that resets the compression table every so often; for those, it's feasible to start reading partway through (discarding only information up to the next reset point). There's a penalty in compression ratio, of course, but that can be tuned by changing the frequency of such restarts. — Charles Duffy, Jun 29 '18 at 15:03

score 4 · Accepted Answer · edited Jun 29 '18 at 16:47

This is a job for a fifo (a pipe) and the tee command. Use this when making your backup.

mkfifo mypipe
tail mypipe -1 > lastline.txt & mysqldump whatever | tee mypipe | gzip >dump.gz
rm mypipe

What's going on?

mkfifo mypipe puts a fifo object into your current working directory. It looks like a file that you can write to, and read from, at the same time.

tail mypipe -1 >lastline.txt uses tail to read whatever you write to mypipe and save the last line in a file.

mysqldump whatever | tee mypipe | gzip >dump.gz does your dump operation, and pipes the output to the tee command. Tee writes the output to mypipe and pipes it along to gzip.

The & between the two parts of the command causes both parts to run at the same time.

rm mypipe gets rid of the fifo object.

Charles Duffy pointed out that some shells (including bash) have process substitution, and so your command can be simpler if you're using one of those shells.

 mysqldump whatever | tee >(tail -1 > lastline.txt ) | gzip >dump.gz

In this case the shell creates its own pipe for you.

Credit: Pipe output to two different commands

This is bad: it fails to "check dump file validity" which is the whole point to start with. — kmkaplan, Jun 30 '18 at 06:06
It exactly hat I was searching: a way to work on the last line of the file without do a full file read ... it's simple as doing it while compressing. Thanks — tdaget, Jul 10 '18 at 22:04

Matthew Story · Answer 2 · 2018-06-30T08:25:31.867

1

You can easily determine the last line that was processed in a pipeline, rather than from disk by just using your existing verification (with zcat and tail) and reading from stdin rather than from the written file:

mysqldump args | gzip - | tee fullDump.gz | zcat - | tail -n1

This will give you the last processed line, unzipped, which you can check against the Dump completed string for success.

This is pretty simple but is not quite 100% bullet-proof as there is a small chance that tee fails to finish writing to disk but completes writing to stdout:

If a write to any successfully opened file operand fails, writes to other successfully opened file operands and standard output shall continue, but the exit status shall be non-zero.

http://pubs.opengroup.org/onlinepubs/9699919799/utilities/tee.html

To protect against this possible failure we can simply embellish this pipeline a little:

mysqldump args | gzip - | { tee foo.gz || echo "fail" | gzip -; } | zcat - | tail -n1

Such that we emit fail as the last line in the event that tee exits non-0. If you are using bash as your shell you could alternately use pipefail (set -euxo pipefail), which would cause this pipeline to exit non-0 if any command failed (rather than just the last one).

edited Jun 30 '18 at 08:25

answered Jun 30 '18 at 02:40

Matthew Story

3,573
15
26

This is bad: it fails to "check dump file validity" which is the whole point to start with. – kmkaplan Jun 30 '18 at 07:36
fair, fixed. Went off the most popular answer rather than the question @kmkaplan – Matthew Story Jun 30 '18 at 07:52
Now I'm puzzled. The tee command could write to file before stdout... Or the opposite. Is it specified? – kmkaplan Jun 30 '18 at 07:58
@kmkaplan we have a bit more work to do: If a write to any successfully opened file operand fails, writes to other successfully opened file operands and standard output shall continue, but the exit status shall be non-zero. ~ http://pubs.opengroup.org/onlinepubs/9699919799/utilities/tee.html – Matthew Story Jun 30 '18 at 08:11
@kmkaplan this edge-case is now covered – Matthew Story Jun 30 '18 at 08:19

Jonas Bjork · Answer 3 · 2018-06-29T19:25:09.403

0

If you really do want to check the integrity of the compressed file you could change from gz and do something like

# strings used in order to clean up if it gets messy
dd if=dump.compressed.not-gz bs=1m skip=195000 | compress-tool -df | strings | grep 'Dump complete'

edit: You could also debug gz when compressing a dump in which you've injected the 'dump complete' string and discover its signature. Database dumps are similar so perhaps it will be the same on all dumps. If so, just grep for it like above but without the compress and strings commands

edited Jun 29 '18 at 19:25

answered Jun 29 '18 at 19:11

Jonas Bjork

158
3

1

You could lookup for the line during the dump process. Anyway, you better use more sofisticated ways to verify backups. – Jun 29 '18 at 19:20

Get last line of a huge gziped file

3 Answers3