most efficient way to split a compressed csv into chunks

Question

I have a few very large, gzip-compressed csv files (the compressed output of mysqldump) -- each around 65 GB.

I need to split them up into compressed-chunks that are each less than 4 GB (after compression), keeping in mind that quoted new line characters exist in each csv file.

What would be the most efficient way to do this on a 'nix command line (e.g. in Debian)?

Similar to this SO although the responses don't correctly account for quoted newline characters.

etopylight · Answer 1 · 2018-03-01T04:41:21.703

9

A method without using temporary disk space.

Chunk Size Estimation

At first, since the compression ratio varies for each row depending on its content, it is hard to get a precise after-compression size target unless you run a 2-pass encoding. However, you can get a rough estimation of how much size a chunk should be before compression by using

gzip -l file.csv.gz | tail -1 | awk '{print int(4*$2/$1)}'

The 4(GB) in this command is your target size for each chunk after compression, so if the result shows 21, it means that you can split the uncompressed file with roughly a chunk size of 21(GB), assuming that the file has a uniform entropy distribution.

Decompress, Split and Compress

We can use the obtained chunk size above to split the file

gzip -dc file.csv.gz | split -C 21G -d - split_ --filter='gzip > $FILE.csv.gz'

gzip -dc decompress the file into stdout
split -C 21G puts at most 21G of records per output file
split --filter='gzip > $FILE.csv.gz' enables direct compression for each splitted file

Bonus Tip: replace all gzip above with pigz to enable faster multi-thread compressing

Update

To preserve the header for each splitted gzipped file, we can use

gzip -dc file.csv.gz | tail -n+2 | split - --filter='{ gzip -dc file.csv.gz | head -1; cat; } | gzip > $FILE.csv.gz'

Some options of split are ignored here for the sake of simplicity, but you get the idea. The trick is to modify the filter option in split so that it prepends the header from the original csv file to the output stream for each splitted file.

edited Mar 01 '18 at 04:41

answered Nov 02 '17 at 08:15

etopylight

1,239
1
10
15

1

liked `pigz` part the most – desaiankitb Feb 27 '18 at 12:21
2

how can we keep header in each file? – desaiankitb Feb 28 '18 at 12:13
1

@desaiankitb Nice question, I updated the answer in case we want to preserve the header for each splitted file – etopylight Mar 01 '18 at 04:44
```header=`gzip -dc file.tab.gz | head -1`; pigz -dc file.tab.gz | sed 1d | split --additional-suffix=.tab -C 300M -d - split_ ; sed -i "1i ${header}" split_*; pigz split_* ``` We could do it in 4 steps... your answer is ofcourse optimal. Thanks. – desaiankitb Mar 01 '18 at 11:42
1

Note that check https://stackoverflow.com/questions/46035099/mac-bash-split-cannot-split-by-line-bytes if you get an error for `split: illegal option -- C` – Vincent Nov 14 '18 at 17:19
You know, I did something like that in the code mentioned [here](https://stackoverflow.com/questions/64124467/breaking-gzipped-json-into-chunks-of-arbitrary-size). I found that I had to allow some headroom (10% was sufficient in my case, but I'm not sure if that's sufficient in the general case) because the size of the gzipped chunk did not scale linearly with the size of the uncompressed chunk. – Aurelia Peters Oct 02 '20 at 14:54

score 0 · Answer 2 · edited Jun 20 '20 at 09:12

1 The gzip stream

Since gzip is a compression with history tables you cannot simply split the compressed stream. You need to decode it. You might save some disk space when you pipe the output to a splitter program.

2 The chunk size

The next task is to get chunks of 4GB. If you cannot safely estimate the compression ratio, the only safe way is to cut every 4GB raw CSV data. This might result in much smaller chunks than 4GB.

3 The CSV stream

Since the CSV format (with newlines) does not allow resynchronization, the entire CVS stream has to be parsed as well - assuming that you do not simply want to split the CSV at an arbitrary locations but at the end of a record, i.e. logical CVS line.

In fact the latter procedure can be somewhat simplified. Assuming that potential double quotes in the CSV data is escaped as usual with double double quotes, every newline character that appears after an even number of double quotes can be assumed to be a record separator. So the parser task reduces to counting the double quotes in the stream.

I am pretty sure that this still can't reasonably be solved with a shell script only. I would recommend some Perl script or something like that to do the parsing and splitting.

The script needs to read from stdin, count the number of bytes and the number of double quotes, and pass the result to a gzip > targetfile. Each time the number of bytes is going to reach the limit of task 2, it should seek for a newline character in the current buffer that is after an even number of double quotes in the stream. Then the bytes up to this point are sent to the current gzip instance and the output stream is closed. Now increment the target file name and open a new gzip output, reset the byte counter and pass the remaining part of the current buffer to the new gzip output stream.

The following script demonstrates the solution:

#!/usr/bin/perl
use strict;

my $targetfile = "target";
my $limit = 1 << 32; # 4GB

my $filenum = 0;
open F, "|-", "gzip >$targetfile-$filenum.gz" or die;
my ($buffer, $bytes, $quotes);
while (read STDIN, $buffer, 1024*1024)
{ $bytes += length $buffer;
  if ($bytes > $limit)
  { my $pos;
    do
    { $pos = 1 + index $buffer, "\n", $pos;
      $pos or die "no valid delimiter found: $bytes";
    } while (((substr($buffer, 0, $pos) =~ tr/"//) + $quotes) & 1);
    print F substr $buffer, 0, $pos or die;
    close F;
    ++$filenum;
    open F, "|-", "gzip >$targetfile-$filenum.gz" or die;
    $buffer = substr $buffer, $pos;
    $bytes = length $buffer;
  }
  $quotes += $buffer =~ tr/"//;
  print F $buffer or die;
}
close F;

The scrips assumes that in a block of 1MB is at least one valid record separator.

4 Invoke the whole pipeline

gzip -d -c sourcefile | perlscript

This will do the entire task. It will not use significantly more than a few MB of memory, mainly for the Perl interpreter.

On disk you need of course twice as much storage to hold the source file as well as the target files.