0

Need a solution to kill nodes like <footer>foobar</footer> and <div class="nav"></div> from many several HTML files.

I want to dump a site to disk without the menus and footers and what not. Ideally I would accomplish this task using basic unix tools like sed. Since it's not XML I can't use xmlstarlet.

Could anyone please suggest recipes, so I can ideally have a script running kill-node.sh 'div class="toplinks"' *.html to prune the bits I don't want. Thank you,

hendry
  • 9,725
  • 18
  • 81
  • 139
  • 1
    HTML vs regex is going to trigger some gut reactions, so you might want to give some more information. Is this a long-lived solution across a large variety of files or more a one-shot deal across a limited set of files? Is there a lot of variation in how the target nodes are formatted across the files or are they identical? If they are identical, can you be more specific as to how they are laid out in the files? Can we modify the entire file with an [X]HTML normalizer first or are we strictly limited to removing the target nodes? – Bert F May 03 '10 at 11:35
  • oneshot. near identical. I wish I knew how to remove an identical 30 line block of text from *.html. :) [X]HTML normalizer... you mean `tidy`? I don't like tidy since it doesn't do HTML5 and it involves at least half an hour of switch madness to get it outputting something sane. – hendry May 03 '10 at 11:49

2 Answers2

2

sed is based on regular expressions. Parsing html with regular expressions is a topic that comes up over and over again here on SO, see e.g regular expression to extract text from HTML or even better Can you provide some examples of why it is hard to parse XML and HTML with a regex?.

That said, if the html pages are written in a similar way you may still be able to construct a regexp that does the job, but be prepared that it is impossible (yes indeed theoretically provable impossible) to build a complete solution working in all cases using regexps.

Community
  • 1
  • 1
Anders Abel
  • 67,989
  • 17
  • 150
  • 217
  • In my case, matching the start and end tag should be straightforward. Nonetheless if you can suggest a better saner command line tool, I'm all ears! – hendry May 03 '10 at 11:22
  • @hendry The
    can not hold , its too late! http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
    – Tim Post May 03 '10 at 11:41
0

Just to drive you regex haters nuts, try this on for size:

sed ':a;$!N;$!ba;s/B/-B/g;s/A/BB/g;s/<\/foo>/A/g;:b;s/<foo>[^A]*A//;tb;s/BB/A/g;s/-B/B/g' foo.html

With foo.html being:

<header>
keep me
<foo>gtg</foo>
</header>
<foo>
delete me</foo>
<foo>gtg</foo>
<foo>gtg</foo>

Otherwise can someone do a cmdline HTML5 parser please. Thanks. x

hendry
  • 9,725
  • 18
  • 81
  • 139