1

I have a file with contents as:

Hi
welcome
! Chunk Start
Line 1
Line2
! Chunk Start
Line 1
Line 2
Line 3
! Chunk Start
Line 1
Line 2
Line 3
Line 1
Line 2
Line 3
Line 4
Line 5
Line 1
Line 2
Line 3
Line 4

Now, everything beginning with "! Chunk Start" and before the next "! Chunk Start" is a chunk, i.e. the lines between "! Chunk Start" , make a chunk. I need to get the contents of each chunk in a single line. i.e.:

Line 1 Line 2
Line 1 Line2 Line 3
Line 1 Line 2 Line 3 Line 1 Line 2 Line 3 Line 4 Line 5 Line 1 Line 2 Line 3 Line 4

I have done this, but I think there should be a better way. The way I have done this is:

grep -A100 "! Chunk Start" file.txt

Rest of the logic is there to concat the lines. But this A100 is what I am worried about. What if there are more than 100 lines in a chunk, this will fail. I probably need to do this with awk/sed. Please suggest.

Tarun
  • 152
  • 1
  • 15

3 Answers3

5

You can use GNU AWK (gawk). It has a GNU extension for a powerful regexp form of the record separator RS to divide the input by ! Chunk Start. Each line of your "chunks" can then be processed as a field. Standard AWK has a limit on the number of fields (99 or something?), but gawk supports up to MAX_LONG fields. This large number of fields should solve your worry about 100+ input lines per chunk.

$ gawk 'BEGIN{RS="! Chunk Start\n";FS="\n"}NR>1{$1=$1;print}' infile.txt

AWK (and GNU AWK) works by dividing input into records, then dividing each record into fields. Here, we are dividing records (record separator RS) based on the string ! Chunk Start and then dividing each record into fields (field separator FS) based on a newline \n. You can also specify a custom output record separator ORS and custom output field separator OFS, but in this case what we want happen to be the defaults (ORS="\n" and OFS=" ").

When dividing into records, the part before the first ! Chunk Start will be considered a record. We ignore this using NR>1. I have interpreted your problem specification

everything beginning with "! Chunk Start" and before the next "! Chunk Start" is a chunk

to mean that once ! Chunk Start has been seen, everything else until the end of input belongs in at least some chunk.

The mysterious $1=$1 forces gawk to reprocess the input line $0, which parses it using the input format (FS), consuming the newlines. The print prints this reprocessed line using the output format (OFS and ORS).

Edit: The version above prints spaces at the end of each line. Thanks to @EdMorton for pointing out that the default field separator FS separates on whitespace (including newlines), so FS should be left unmodified:

$ gawk 'BEGIN{RS="! Chunk Start\n"}NR>1{$1=$1;print}' infile.txt
e0k
  • 6,961
  • 2
  • 23
  • 30
  • Thanks for the answer and a very good explanation. The solution is working well. – Tarun Jan 31 '16 at 20:35
  • 1
    This version leaves spaces at the end of the lines. (I was hoping either it wouldn't make a difference or no one would notice.) It's an easy thing to fix, but the `sed` answer by @potong filters this out already and is probably the more elegant and efficient solution. – e0k Jan 31 '16 at 20:42
  • 1
    You were definitely on the right track. If you hadn't changed the `FS` from it's default value (which matches all chains of white space, including newlines) you wouldn't have created those end of line spaces. I think nawk or maybe just old, broken awk, has a limit on number of fields but that's not a concern for standard (POSIX) awks. – Ed Morton Jan 31 '16 at 21:22
3

This might work for you (GNU sed):

sed '0,/^! Chunk Start/d;:a;$!N;/! Chunk Start/!s/\n/ /;ta;P;d' file

Delete upto and including the first line containing ! Chunk Start. Gather up lines replacing the newline by a space. When the next match is found print the first line, delete the pattern space and repeat.

potong
  • 55,640
  • 6
  • 51
  • 83
  • This is working fine for me. Thanks a lot. Can you please explain the command. – Tarun Jan 31 '16 at 20:29
  • 1
    This may be cleaner than my `gawk` version. There also seems to be [no limit on the line length](https://www.gnu.org/software/sed/manual/html_node/Limitations.html) in GNU `sed`. I'm not sure about the portability of this particular `sed` command, but `gawk` is usually something I have to install manually as it doesn't come with all distributions. On the other hand, you can expect some kind of `sed` to be pretty much everywhere. – e0k Jan 31 '16 at 20:37
  • @e0k You can expect some kind of `awk` to be everywhere too and not all seds are GNU sed, just like not all awks are GNU awk. – Ed Morton Jan 31 '16 at 21:28
3

Good grief. Just use awk:

$ awk -v RS='! Chunk Start' '{$1=$1}NR>1' file
Line 1 Line2
Line 1 Line 2 Line 3
Line 1 Line 2 Line 3 Line 1 Line 2 Line 3 Line 4 Line 5 Line 1 Line 2 Line 3 Line 4

The above uses GNU awk for multi-char RS.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 1
    ["In anything at all, perfection is finally attained not when there is no longer anything to add, but when there is no longer anything to take away, when a body has been stripped down to its nakedness."](http://english.stackexchange.com/questions/38837/where-does-this-translation-of-saint-exuperys-quote-on-design-come-from). AWK is best at this. – e0k Jan 31 '16 at 21:37
  • 3
    Careful - that might imply that brevity is desirable in itself but it's not, conciseness (brevity with clarity) is what's desirable in software :-). – Ed Morton Jan 31 '16 at 21:45