I have program (gawk) that outputs stream of data to its STDOUT. The data processed is literally 10s of GBs. I don't want to persist it in a single file but rather split it into chunks and potentially apply some extra processing (like compression) to each before saving.
my data is a sequence of records and I don't want splitting to cut record in half. Each record matches the following regexp:
^\{index.+?\}\}\n\{.+?\}$
or for simplicity can assume that two rows (first uneven then even when numbering from beginning of the stream) always make a record.
Can I:
- use some standard linux command to split STDIN by defining preferable size of chunk? Doesn't need to be exact given record variable size can't guarantee it. Alternatively just number of records if def. by size is impossible
- compress each chunk and store in a file (with some numbering in its name like 001, 002 etc..)
I've became aware of commands like GNU parallel or csplit
but don't know how to put it together.
Would be nice if functionality explained above could achieved without writing custom perl script for it. This however could be another, last resort solution but again, not sure how to best implement it.