Grep lines from a file in batches according to a format

Question

I have a file with contents as:

Hi
welcome
! Chunk Start
Line 1
Line2
! Chunk Start
Line 1
Line 2
Line 3
! Chunk Start
Line 1
Line 2
Line 3
Line 1
Line 2
Line 3
Line 4
Line 5
Line 1
Line 2
Line 3
Line 4

Now, everything beginning with "! Chunk Start" and before the next "! Chunk Start" is a chunk, i.e. the lines between "! Chunk Start" , make a chunk. I need to get the contents of each chunk in a single line. i.e.:

Line 1 Line 2
Line 1 Line2 Line 3
Line 1 Line 2 Line 3 Line 1 Line 2 Line 3 Line 4 Line 5 Line 1 Line 2 Line 3 Line 4

I have done this, but I think there should be a better way. The way I have done this is:

grep -A100 "! Chunk Start" file.txt

Rest of the logic is there to concat the lines. But this A100 is what I am worried about. What if there are more than 100 lines in a chunk, this will fail. I probably need to do this with awk/sed. Please suggest.

e0k · Answer 1 · 2016-01-31T21:29:25.710

You can use GNU AWK (gawk). It has a GNU extension for a powerful regexp form of the record separator RS to divide the input by ! Chunk Start. Each line of your "chunks" can then be processed as a field. Standard AWK has a limit on the number of fields (99 or something?), but gawk supports up to MAX_LONG fields. This large number of fields should solve your worry about 100+ input lines per chunk.

$ gawk 'BEGIN{RS="! Chunk Start\n";FS="\n"}NR>1{$1=$1;print}' infile.txt

AWK (and GNU AWK) works by dividing input into records, then dividing each record into fields. Here, we are dividing records (record separator RS) based on the string ! Chunk Start and then dividing each record into fields (field separator FS) based on a newline \n. You can also specify a custom output record separator ORS and custom output field separator OFS, but in this case what we want happen to be the defaults (ORS="\n" and OFS=" ").

When dividing into records, the part before the first ! Chunk Start will be considered a record. We ignore this using NR>1. I have interpreted your problem specification

everything beginning with "! Chunk Start" and before the next "! Chunk Start" is a chunk

to mean that once ! Chunk Start has been seen, everything else until the end of input belongs in at least some chunk.

The mysterious $1=$1 forces gawk to reprocess the input line $0, which parses it using the input format (FS), consuming the newlines. The print prints this reprocessed line using the output format (OFS and ORS).

Edit: The version above prints spaces at the end of each line. Thanks to @EdMorton for pointing out that the default field separator FS separates on whitespace (including newlines), so FS should be left unmodified:

$ gawk 'BEGIN{RS="! Chunk Start\n"}NR>1{$1=$1;print}' infile.txt

Thanks for the answer and a very good explanation. The solution is working well. — Tarun, Jan 31 '16 at 20:35
This version leaves spaces at the end of the lines. (I was hoping either it wouldn't make a difference or no one would notice.) It's an easy thing to fix, but the `sed` answer by @potong filters this out already and is probably the more elegant and efficient solution. — e0k, Jan 31 '16 at 20:42
You were definitely on the right track. If you hadn't changed the `FS` from it's default value (which matches all chains of white space, including newlines) you wouldn't have created those end of line spaces. I think nawk or maybe just old, broken awk, has a limit on number of fields but that's not a concern for standard (POSIX) awks. — Ed Morton, Jan 31 '16 at 21:22

score 3 · Accepted Answer · answered Jan 31 '16 at 20:16

3

This might work for you (GNU sed):

sed '0,/^! Chunk Start/d;:a;$!N;/! Chunk Start/!s/\n/ /;ta;P;d' file

Delete upto and including the first line containing ! Chunk Start. Gather up lines replacing the newline by a space. When the next match is found print the first line, delete the pattern space and repeat.

answered Jan 31 '16 at 20:16

potong

55,640
6
51
83

This is working fine for me. Thanks a lot. Can you please explain the command. – Tarun Jan 31 '16 at 20:29
1

This may be cleaner than my `gawk` version. There also seems to be [no limit on the line length](https://www.gnu.org/software/sed/manual/html_node/Limitations.html) in GNU `sed`. I'm not sure about the portability of this particular `sed` command, but `gawk` is usually something I have to install manually as it doesn't come with all distributions. On the other hand, you can expect some kind of `sed` to be pretty much everywhere. – e0k Jan 31 '16 at 20:37
@e0k You can expect some kind of `awk` to be everywhere too and not all seds are GNU sed, just like not all awks are GNU awk. – Ed Morton Jan 31 '16 at 21:28

score 3 · Answer 3 · answered Jan 31 '16 at 21:16

3

Good grief. Just use awk:

$ awk -v RS='! Chunk Start' '{$1=$1}NR>1' file
Line 1 Line2
Line 1 Line 2 Line 3
Line 1 Line 2 Line 3 Line 1 Line 2 Line 3 Line 4 Line 5 Line 1 Line 2 Line 3 Line 4

The above uses GNU awk for multi-char RS.

answered Jan 31 '16 at 21:16

Ed Morton

188,023
17
78
185

1

["In anything at all, perfection is finally attained not when there is no longer anything to add, but when there is no longer anything to take away, when a body has been stripped down to its nakedness."](http://english.stackexchange.com/questions/38837/where-does-this-translation-of-saint-exuperys-quote-on-design-come-from). AWK is best at this. – e0k Jan 31 '16 at 21:37
3

Careful - that might imply that brevity is desirable in itself but it's not, conciseness (brevity with clarity) is what's desirable in software :-). – Ed Morton Jan 31 '16 at 21:45

Grep lines from a file in batches according to a format

3 Answers3