Splitting gzipped logfiles without storing the ungzipped splits on disk

Question

I have a recurring task of splitting a set of large (about 1-2 GiB each) gzipped Apache logfiles into several parts (say chunks of 500K lines). The final files should be gzipped again to limit the disk usage.

On Linux I would normally do:

zcat biglogfile.gz | split -l500000

The resulting files files will be named xaa, xab, xac, etc So I do:

gzip x*

The effect of this method is that as an intermediate result these huge files are temporarily stored on disk. Is there a way to avoid this intermediate disk usage?

Can I (in a way similar to what xargs does) have split pipe the output through a command (like gzip) and recompress the output on the fly? Or am I looking in the wrong direction and is there a much better way to do this?

Thanks.

I would look at implementing split style functionality in a scripting language, where you could write the lines straight into gzipped files. — a'r, Oct 18 '10 at 15:39

score 21 · Accepted Answer · edited Sep 19 '19 at 00:41

21

You can use the split --filter option as explained in the manual e.g.

zcat biglogfile.gz | split -l500000 --filter='gzip > $FILE.gz'

Edit: not aware when --filter option was introduced but according to comments, it is not working in core utils 8.4.

edited Sep 19 '19 at 00:41

Nick Chammas

11,843
8
56
115

answered Jul 10 '14 at 08:10

jimkont

913
1
11
18

1

Thanks. I think using a feature in split that was designed to do this kind of operation is always better than homegrown code. – Niels Basjes Aug 10 '14 at 18:06
2

very nice but note that split of coreutils 8.4 does not have a `filter` argumnet – zach Mar 02 '15 at 22:19

score 3 · Answer 2 · edited Jul 04 '14 at 11:54

3

A script like the following might suffice.

#!/usr/bin/perl
use PerlIO::gzip;

$filename = 'out';
$limit = 500000;

$fileno = 1;
$line = 0;

while (<>) {
    if (!$fh || $line >= $limit) { 
        open $fh, '>:gzip', "$filename_$fileno"; 
        $fileno++;
        $line = 0; 
    }
    print $fh $_; $line++;
}

edited Jul 04 '14 at 11:54

user7610

25,267
15
124
150

answered Oct 18 '10 at 15:51

a'r

35,921
7
66
67

Thanks, your quick example helped me a lot. With two minor fixes (first line must start with #!/ and after the $fileno++ an additional $line=0 is needed) it worked good enough for my purposes. – Niels Basjes Oct 20 '10 at 08:56
Thanks. I've added those to the script for completeness. – a'r Oct 20 '10 at 09:59

score 0 · Answer 3 · answered Nov 11 '20 at 03:18

In case people need to keep the 1st row (the header) in each of the pieces

zcat bigfile.csv.gz | tail -n +2 | split -l1000000 --filter='{ { zcat bigfile.csv.gz | head -n 1 | gzip; gzip; } > $FILE.gz; };'

I know it's a bit clunky. I'm looking for a more elegant solution.

score -1 · Answer 4 · answered Oct 18 '10 at 15:47

-1

There's zipsplit, but that uses the zip algorithm as opposed to the gzip algorithm.

answered Oct 18 '10 at 15:47

Tony Miller

9,059
2
27
46

Splitting gzipped logfiles without storing the ungzipped splits on disk

4 Answers4

Linked