11

I have a recurring task of splitting a set of large (about 1-2 GiB each) gzipped Apache logfiles into several parts (say chunks of 500K lines). The final files should be gzipped again to limit the disk usage.

On Linux I would normally do:

zcat biglogfile.gz | split -l500000

The resulting files files will be named xaa, xab, xac, etc So I do:

gzip x*

The effect of this method is that as an intermediate result these huge files are temporarily stored on disk. Is there a way to avoid this intermediate disk usage?

Can I (in a way similar to what xargs does) have split pipe the output through a command (like gzip) and recompress the output on the fly? Or am I looking in the wrong direction and is there a much better way to do this?

Thanks.

Niels Basjes
  • 10,424
  • 9
  • 50
  • 66
  • I would look at implementing split style functionality in a scripting language, where you could write the lines straight into gzipped files. – a'r Oct 18 '10 at 15:39

4 Answers4

21

You can use the split --filter option as explained in the manual e.g.

zcat biglogfile.gz | split -l500000 --filter='gzip > $FILE.gz'

Edit: not aware when --filter option was introduced but according to comments, it is not working in core utils 8.4.

Nick Chammas
  • 11,843
  • 8
  • 56
  • 115
jimkont
  • 913
  • 1
  • 11
  • 18
  • 1
    Thanks. I think using a feature in split that was designed to do this kind of operation is always better than homegrown code. – Niels Basjes Aug 10 '14 at 18:06
  • 2
    very nice but note that split of coreutils 8.4 does not have a `filter` argumnet – zach Mar 02 '15 at 22:19
3

A script like the following might suffice.

#!/usr/bin/perl
use PerlIO::gzip;

$filename = 'out';
$limit = 500000;

$fileno = 1;
$line = 0;

while (<>) {
    if (!$fh || $line >= $limit) { 
        open $fh, '>:gzip', "$filename_$fileno"; 
        $fileno++;
        $line = 0; 
    }
    print $fh $_; $line++;
}
user7610
  • 25,267
  • 15
  • 124
  • 150
a'r
  • 35,921
  • 7
  • 66
  • 67
  • Thanks, your quick example helped me a lot. With two minor fixes (first line must start with #!/ and after the $fileno++ an additional $line=0 is needed) it worked good enough for my purposes. – Niels Basjes Oct 20 '10 at 08:56
  • Thanks. I've added those to the script for completeness. – a'r Oct 20 '10 at 09:59
0

In case people need to keep the 1st row (the header) in each of the pieces

zcat bigfile.csv.gz | tail -n +2 | split -l1000000 --filter='{ { zcat bigfile.csv.gz | head -n 1 | gzip; gzip; } > $FILE.gz; };'

I know it's a bit clunky. I'm looking for a more elegant solution.

Zach
  • 862
  • 11
  • 10
-1

There's zipsplit, but that uses the zip algorithm as opposed to the gzip algorithm.

Tony Miller
  • 9,059
  • 2
  • 27
  • 46