7

I have program (gawk) that outputs stream of data to its STDOUT. The data processed is literally 10s of GBs. I don't want to persist it in a single file but rather split it into chunks and potentially apply some extra processing (like compression) to each before saving.

my data is a sequence of records and I don't want splitting to cut record in half. Each record matches the following regexp:

^\{index.+?\}\}\n\{.+?\}$

or for simplicity can assume that two rows (first uneven then even when numbering from beginning of the stream) always make a record.

Can I:

  • use some standard linux command to split STDIN by defining preferable size of chunk? Doesn't need to be exact given record variable size can't guarantee it. Alternatively just number of records if def. by size is impossible
  • compress each chunk and store in a file (with some numbering in its name like 001, 002 etc..)

I've became aware of commands like GNU parallel or csplit but don't know how to put it together. Would be nice if functionality explained above could achieved without writing custom perl script for it. This however could be another, last resort solution but again, not sure how to best implement it.

Ole Tange
  • 31,768
  • 5
  • 86
  • 104
msciwoj
  • 772
  • 7
  • 23
  • Is there some reason why you can't just use `split -l` with an even number for the `-l` parameter ? See: [man split](http://linux.die.net/man/1/split) – Paul R Mar 25 '14 at 08:10
  • @PaulR there is - I would need first to instantiate the whole data stream as a physical file on disk. – msciwoj Mar 26 '14 at 07:03

2 Answers2

8

GNU Parallel can split stdin into chunks of records. This will split stdin into 50 MB chunks with each record being 2 lines. Each chunk will be passed to gzip and compressed to the name [chunk number].gz:

cat big | parallel -l2 --pipe --block 50m gzip ">"{#}.gz

If you know your second line will never start with '{index' you can use '{index' as the record start:

cat big | parallel --recstart '{index' --pipe --block 50m gzip ">"{#}.gz

You can then easily test if the splitting went correctly by:

parallel zcat {} \| wc -l ::: *.gz

Unless your records are all the same length you will probably see a different number of lines, but all even.

Watch the intro video for a quick introduction: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Walk through the tutorial (man parallel_tutorial). You command line will love you for it.

Ole Tange
  • 31,768
  • 5
  • 86
  • 104
  • OK, looks very good. One question though - parallel is used here as a workaround so it's rather to split the data stream rather than running all those compression jobs concurrently. In my case this is a bit of a concern as the machine I'm running it has low spec. Could this be limited to not more than one or two downstream jobs at a time? – msciwoj Mar 25 '14 at 11:22
  • GNU Parallel does _not_ run all jobs in parallel. It defaults to 1 job per CPU core. Watch the intro videos and walk through the tutorial to learn how to change the default. – Ole Tange Mar 25 '14 at 20:36
  • I'm getting this message back: parallel: Warning: YOU ARE USING --tollef. IF THINGS ARE ACTING WEIRD USE --gnu. parallel: Warning: --tollef is obsolete and will be retired 20140222. parallel: Warning: See: http://lists.gnu.org/archive/html/parallel/2013-02/msg00018.html /bin/bash: -c: option requires an argument – msciwoj Mar 26 '14 at 06:59
  • Are you hit by http://stackoverflow.com/questions/16448887/gnu-parallel-not-working-at-all If so: Complain to your distribution maintainer. – Ole Tange Mar 26 '14 at 07:28
  • OK, added --gnu and now getting this message `parallel: Warning: No more processes: Decreasing number of running jobs to 1. Raising ulimit -u may help.` I call parallel like: `parallel --gnu -l2 --pipe -j 1 --block 30m gzip ">"{#}.gz`, it produces few files but after a while stops. The command still runs but prints continuously the same warning. I've experimented and the bigger block I set the sooner it stops... any ideas? – msciwoj Mar 27 '14 at 12:27
  • That sound very weird. See if you can report a bug (follow REPORTING BUGS in 'man parallel'). – Ole Tange Mar 27 '14 at 15:06
3

You can either use split utility (which is shipped with GNU coreutils package in contrast to parallel therefore more chances to be found on the target system) which can read STDIN (in addition to ordinary files), use by-line or by-size thresholds and apply custom logic to chunks via --filter CMD option. Please refer to the corresponding man page for usage details.

cat target | split -d -l10000 --suffix-length 5 --filter 'gzip > $FILE.gz' - prefix.

Is going to split STDIN into gzipped chunks 10000 lines each, with name prefix.<CHUNK_NUMBER>, where <CHUNK_NUMBER> starts from 0 and is padded with zeros to the length of 5 (e.g. 00000, 00001, 00002, etc.). Start number and extra suffix can be set too.

DimG
  • 1,641
  • 1
  • 16
  • 23