1 The gzip stream
Since gzip is a compression with history tables you cannot simply split the compressed stream. You need to decode it.
You might save some disk space when you pipe the output to a splitter program.
2 The chunk size
The next task is to get chunks of 4GB. If you cannot safely estimate the compression ratio, the only safe way is to cut every 4GB raw CSV data. This might result in much smaller chunks than 4GB.
3 The CSV stream
Since the CSV format (with newlines) does not allow resynchronization, the entire CVS stream has to be parsed as well - assuming that you do not simply want to split the CSV at an arbitrary locations but at the end of a record, i.e. logical CVS line.
In fact the latter procedure can be somewhat simplified. Assuming that potential double quotes in the CSV data is escaped as usual with double double quotes, every newline character that appears after an even number of double quotes can be assumed to be a record separator. So the parser task reduces to counting the double quotes in the stream.
I am pretty sure that this still can't reasonably be solved with a shell script only. I would recommend some Perl script or something like that to do the parsing and splitting.
The script needs to read from stdin, count the number of bytes and the number of double quotes, and pass the result to a gzip > targetfile
. Each time the number of bytes is going to reach the limit of task 2, it should seek for a newline character in the current buffer that is after an even number of double quotes in the stream. Then the bytes up to this point are sent to the current gzip instance and the output stream is closed. Now increment the target file name and open a new gzip output, reset the byte counter and pass the remaining part of the current buffer to the new gzip output stream.
The following script demonstrates the solution:
#!/usr/bin/perl
use strict;
my $targetfile = "target";
my $limit = 1 << 32; # 4GB
my $filenum = 0;
open F, "|-", "gzip >$targetfile-$filenum.gz" or die;
my ($buffer, $bytes, $quotes);
while (read STDIN, $buffer, 1024*1024)
{ $bytes += length $buffer;
if ($bytes > $limit)
{ my $pos;
do
{ $pos = 1 + index $buffer, "\n", $pos;
$pos or die "no valid delimiter found: $bytes";
} while (((substr($buffer, 0, $pos) =~ tr/"//) + $quotes) & 1);
print F substr $buffer, 0, $pos or die;
close F;
++$filenum;
open F, "|-", "gzip >$targetfile-$filenum.gz" or die;
$buffer = substr $buffer, $pos;
$bytes = length $buffer;
}
$quotes += $buffer =~ tr/"//;
print F $buffer or die;
}
close F;
The scrips assumes that in a block of 1MB is at least one valid record separator.
4 Invoke the whole pipeline
gzip -d -c sourcefile | perlscript
This will do the entire task. It will not use significantly more than a few MB of memory, mainly for the Perl interpreter.
On disk you need of course twice as much storage to hold the source file as well as the target files.