0

Hey guys pretty stuck with this one, I'm supposed format an xml snippet with sed.

This is the original code snippet:

<input>
    <program_name>
            CS
    </program_name>
    <course_name>
                            ART CLASS
    </course_name>
    <instructor>
                John Smith
    </instructor>
</input>

My sed command should format it into the following:

    <input>
        <program_name>CS</program_name>
        <course_name>ART CLASS</course_name>
        <instructor>John Smith</instructor>
  </input>

So far I have the following:

sed -r 'N;N;s/<([a-z_]+)>( *\n* *)([[a-z]+ ?[a-z]+]+)( *\n* *)(<\1>)/<\1>\3\5/g' question.txt

Unfortunately nothing seemed to change, any hints/help are greatly appreciated.

Nick Powers
  • 127
  • 8

1 Answers1

2

Disclaimer: Stream editors and regex are not good tools for parsing markup languages such as XML or HTML, in this case we did not have to rely on tag matching, but if you actually need to parse or do anything fancy with xml in bash, go here How to parse XML in Bash?


I found enough errors in your original regex that i chose to switch to my own to do what you want:

s/>\s*\n\s*(\w.*\w)\s*\n\s*</>\1</

and here's a demo

Besides the regex typo, you may run into other issues with buffering multiple lines into sed or having overlapping matches, you may want to check out this question for writing a good multiline bash script: How can I replace a newline (\n) using sed?

Will Barnwell
  • 4,049
  • 21
  • 34
  • Thanks for editing your question, I removed the "fix" i had for your regex because that alone did not solve your regex's problems – Will Barnwell Sep 29 '17 at 01:19