Using awk to format text

Question

I'm getting hard times understanding how to achieve what I want using awk and after searching for quite some time, I couldn't find the solution I'm looking for.

I have an input text that looks like this:

Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
 (
Element 4
)
Another line
 (
Element 1, span 1 to 
Element 5, span 4
)
Another Line

I want to properly format the weird lines between ' (' and ')'. The expected output is as follow:

Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(Element 4)
Another line
(Element 1, span 1 to Element 5, span 4)
Another Line

Looking up on stack overflow I found this :
How to select lines between two marker patterns which may occur multiple times with awk/sed

So what I'm using now is echo $text | awk '/ $/{flag=1;next}/$/{flag=0}flag'

Which almost works except it filters out the non-matching lines, here's the output produced by this very last command:

(Element 4)
(Element 1, span 1 to Element 5, span 4)

Anyone knows how-to do this? I'm open to any suggestion, including not-using awk if you know better.

Bonus point if you teach me how to remove syntaxic coloration on my question code blocks :)

Thanks a billion times

Edit: Ok, so I accepted @EdMorton's solution as he provided something using awk (well, GNU awk). However, I'm currently using @aaron's sed voodoo incantations with great success and will probably continue doing so until I hit anything new on that specific usecase.

I strongly suggest reading EdMorton's explanation, last paragraph made my day. If anyone passing by has good ressources regarding awk/sed they can share, feel free to do so in the comments.

You can use `` to not highlight a block of code. See [syntax-highlighting](http://stackoverflow.com/editing-help#syntax-highlighting). — e0k, Dec 16 '16 at 15:00
So you want to print what is inside the parenthesis marks `()`, but also what is outside? Is the only modification to remove line breaks between the `()`? — e0k, Dec 16 '16 at 15:03
@e0K yes, exactly, and many thanks for the syntax-highlighting trick. Must admit I was too lazy to search for that after so many searches regarding my awk problem :) — daformat, Dec 16 '16 at 15:06
Click the help button (looks like '?'), then "advanced help" it will tell you more about formatting. — e0k, Dec 16 '16 at 15:08

Aaron · Answer 1 · 2016-12-16T17:02:12.833

5

Here's how I would do it with GNU sed :

s/^\s*(/(/;/^(/{:l N;/)/b e;b l;:e s/\n//g}

Which, for those who don't speak gibberish, means :

remove the leading spaces from lines that start with spaces and an opening bracket
test if the line now start with an opening bracket. If that's the case, do the following :
- mark this spot as the label l, which denotes the start of a loop
- add a line from the input to the pattern space
- test if you now have a closing bracket in your pattern space
- if so, jump to the label e
- (if not) jump to the label l
- mark this spot as the label e, which denotes the end of the code
- remove the linefeeds from the pattern space
(implicitly print the pattern space, whether it has been modified or not)

This can probably be refined, but it does the trick :

$ echo """Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
 (
Element 4
)
Another line
 (
Element 1, span 1 to
Element 5, span 4
)
Another Line """ | sed 's/^\s*(/(/;/^(/{:l N;/)/b e;b l;:e s/\n//g}'

Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(Element 4)
Another line
(Element 1, span 1 to Element 5, span 4)
Another Line

Edit : if you can disable history expansion (set +H), this sed command is nicer : s/^\s*(/(/;/^(/{:l N;/)/!b l;s/\n//g}

edited Dec 16 '16 at 17:02

answered Dec 16 '16 at 15:12

Aaron

24,009
2
33
57

Well, that was fast, it almost works, however I still get a space on the beginning of the lines starting with parenthesis with my input. – daformat Dec 16 '16 at 15:18
@MathieuJouhet I've edited it to remove the leading space – Aaron Dec 16 '16 at 15:21
This is absolutely perfect @Aaron, and you also took the time to explain the gibberish, SUPER helpful :) Do you mind if I wait a little before accepting your answer? I would love to see what other solutions people could suggest before doing so. – daformat Dec 16 '16 at 15:27
Also, @Aaron, if you're willing to, can you explain me why the first version you gave didn't remove the leading space? (was /^\s*(/{:l N;/)/b e;b l;:e s/\n//g}) – daformat Dec 16 '16 at 15:30
1

Sure, take your time to accept an answer. My first version just didn't contain any way to remove the leading spaces, while in the second I've added the search/replace `s/^\s*(/(/` which does that. – Aaron Dec 16 '16 at 15:32
My hat's off to you @Aaron, you solved my problem in such a quick and helpful way! However I accepted EdMorton's answer as the question title mentions specifically awk. I edited my question to mention both of you guys. – daformat Dec 17 '16 at 19:54

Ed Morton · Accepted Answer · 2016-12-16T19:16:45.000

sed is for simple substitutions on individual lines, that is all. If you try to do anything else with it then you are using constructs that became obsolete in the mid-1970s when awk was invented, are almost certainly non-portable and inefficient, are always just a pile of indecipherable arcane runes, and are used today just for mental exercise.

The following uses GNU awk for multi-char RS, RT and the \s shorthand for [[:space:]] and works by simply isolating the (...) strings and then doing whatever you want with them:

$ cat tst.awk
BEGIN {
    RS="[(][^)]+[)]"             # a regexp for the string you want to isolate in RT
    ORS=""                       # disable appending of newlines so we print as-is
}
{
    gsub(/\n[[:blank:]]+$/,"\n") # remove any blanks before RT at the start of each line

    sub(/\(\s+/,"(",RT)          # remove spaces after ( in RT
    sub(/\s+\)/,")",RT)          # remove spaces before ) in RT
    gsub(/\s+/," ",RT)           # compress each chain of spaces to one blank char in RT

    print $0 RT                  # print the result
}

$ awk -f tst.awk file
Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(Element 4)
Another line
(Element 1, span 1 to Element 5, span 4)
Another Line

If you're considering using a sed solution for this also consider how you would enhance it if/when you have the slightest requirements change. Any change to the above awk code would be trivial and obvious while a change to the equivalent sed code would require first sacrificing a goat under a blood moon then breaking out your copy of the Rosetta Stone...

I absolutely love the last paragraph of your answer. And my question was about using awk so I believe this is probably going to be the accepted answer. — daformat, Dec 17 '16 at 01:43

score 1 · Answer 3 · answered Dec 16 '16 at 15:58

It's doable in awk, and maybe there's a slicker way than this. It looks for lines between and including those containing only blanks and either an open or close parenthesis, and processes them specially. Everything else it just prints:

awk '/^ *\( *$/,/^ *\) *$/ {
        sub(/^ */, "");
        sub(/ *$/, "");
        if ($1 ~ /[()]/) hold = hold $1; else hold = hold " " $0
        if ($0 ~ /\)/) {
            sub(/\( /, "(", hold)
            sub(/ \)/, ")", hold)
            print hold
            hold = ""
        }
        next
     }
     { print }' data

The variable hold is initially empty. The first pair of sub calls strip leading and trailing blanks (copying the data from the question, there's a blank after span 1 to). The if adds the ( or ) to hold without a space, or the line to hold after a space. If the close parenthesis is present, remove the space after the open parenthesis and before the close parenthesis, print hold, and reset hold to empty. Always skip the rest of the script with next. The rest of the script is { print } — print unconditionally, often written 1 by minimalists.

The file data is copy'n'paste from the data in the question.

Output:

Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(Element 4)
Another line
(Element 1, span 1 to Element 5, span 4)
Another Line

The 'Another Line' (with capital L) has a trailing blank because the data in the question does.

score 0 · Answer 4 · answered Dec 16 '16 at 15:57

With awk

$ cat fmt.awk
function rem_wsp(s) { # remove white spaces
    gsub(/[\t ]/, "", s)
    return s
}

function beg() {return rem_wsp($0)=="("}
function end() {return rem_wsp($0)==")"}
function dump_block() {
    print "(" block ")"
}

beg() {
    in_block = 1
    next
}

end() {
    dump_block()
    in_block = block = ""
    next
}

in_block {
    if (length(block)>0) sep = " "
    block = block sep $0
    next
}

{
    print
}

END {
    if (in_block) dump_block()
}

Usage:

$ awk -f fmt.awk fime.dat

Using awk to format text

4 Answers4