0

I am looking in a an HTML file to modify for the purpose of easy parsing. I need to put each item of HTML after body to separate line.

eg my current HTML file is

<?xml version="1.0"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta content="text/html; charset=utf-8" http-equiv="Content-type" />
    <meta name="ncc:files" content="78" />
  </head>
  <body>
    <h1 class="title" id="h1"><a href="001.smil#txt4">ABOUT DAISY</a></h1>
    <h1 class="section" id="h7">
      <a href="002.smil#txt10">Cover</a>
    </h1>
    <span class="page-normal" id="p13">
      <a href="002.smil#txt15">1</a>
    </span>
    <h1 class="section" id="h18">
      <a href="003.smil#txt21">Swadesaabhimaani, K. Kelappan, Muhammad Abdul Rahiman</a>
    </h1>
    <span class="page-normal" id="p24">
      <a href="003.smil#txt26">2</a>
    </span>
    <span class="page-normal" id="p33">
      <a href="003.smil#txt35">3</a>
    </span>
    <h1 class="section" id="h38">
      <a href="004.smil#txt41">Title</a>
    </h1>
    <span class="page-normal" id="p45">
      <a href="004.smil#txt47">4</a>
    </span>
    <h1 class="section" id="h50">
      <a href="005.smil#txt53">Publication</a>
    </h1>
    <span class="page-normal" id="p69">
      <a href="005.smil#txt71">5</a>
    </span>
    <h1 class="section" id="h74">
      <a href="006.smil#txt77">K. Ramakrishnapilla</a>
    </h1>
      </body>
</html>

Required html after <body> tag is

<h1 class="title" id="h1"><a href="001.smil#txt4">ABOUT DAISY</a></h1>
<h1 class="section" id="h7"><a href="002.smil#txt10">Cover</a></h1>
<span class="page-normal" id="p13"><a href="002.smil#txt15">1</a></span>

Means each tag content must come in same line without split. Please advise how it can be done with sed.

Jens
  • 69,818
  • 15
  • 125
  • 179
Anes
  • 35
  • 1
  • 9
  • while it might be possible to do this with `sed` as a super-advanced challenge, you'll do better reviewing answers here on S.O. that use `awk` that set a flag variable to indicate 'inside ' . But, see http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 . Sooner than later you'll hit problems with `sed` or `awk` for operating on xml(ish) data. You'll need to learn a language that has xml support available. Good luck. – shellter Jan 19 '16 at 14:57

1 Answers1

0

It can be done like: joining all the lines into one, with e.g. tr -d '\n' INFILE > OUTFILE.

Then find out all the container tags which you want to have on a separate line, and create a sed script out of it, like e.g., you want <p>, <h1>:

#sedscript.sed
s/<h1>/\n&/
s/<\/h1>/&\n/
s/<p>/\n&/
s/<\/p>/&\n/

Then run it with sed -f sedscript.sed OUTFILE.

Although it might suit your needs, it can't handle mal-formatted HTML (e.g. overlapping tags, etc.).

Zsolt Botykai
  • 50,406
  • 14
  • 85
  • 110