You can use GNU AWK (gawk
). It has a GNU extension for a powerful regexp form of the record separator RS
to divide the input by ! Chunk Start
. Each line of your "chunks" can then be processed as a field. Standard AWK has a limit on the number of fields (99 or something?), but gawk
supports up to MAX_LONG
fields. This large number of fields should solve your worry about 100+ input lines per chunk.
$ gawk 'BEGIN{RS="! Chunk Start\n";FS="\n"}NR>1{$1=$1;print}' infile.txt
AWK (and GNU AWK) works by dividing input into records, then dividing each record into fields. Here, we are dividing records (record separator RS
) based on the string ! Chunk Start
and then dividing each record into fields (field separator FS
) based on a newline \n
. You can also specify a custom output record separator ORS
and custom output field separator OFS
, but in this case what we want happen to be the defaults (ORS="\n"
and OFS=" "
).
When dividing into records, the part before the first ! Chunk Start
will be considered a record. We ignore this using NR>1
. I have interpreted your problem specification
everything beginning with "! Chunk Start" and before the next "! Chunk Start" is a chunk
to mean that once ! Chunk Start
has been seen, everything else until the end of input belongs in at least some chunk.
The mysterious $1=$1
forces gawk
to reprocess the input line $0
, which parses it using the input format (FS
), consuming the newlines. The print
prints this reprocessed line using the output format (OFS
and ORS
).
Edit: The version above prints spaces at the end of each line. Thanks to @EdMorton for pointing out that the default field separator FS
separates on whitespace (including newlines), so FS
should be left unmodified:
$ gawk 'BEGIN{RS="! Chunk Start\n"}NR>1{$1=$1;print}' infile.txt