0

Given that I have some HTML in the form:

<html>
  <body>
    <div id="1" class="c">some other html stuff</div>
  </body>
</html>

How can I extract this with Unix script?

some other html stuff
Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
user601836
  • 3,215
  • 4
  • 38
  • 48

2 Answers2

3

You may checkout the html-xml-utils and the hxselect command which allows you to extract elements that match a CSS selector:

hxselect '.c' < test.htm

This assumes that your input is a well-formed XML document. If it is not you might need to resort to regular expressions and the possible consequences of that.

Community
  • 1
  • 1
Darin Dimitrov
  • 1,023,142
  • 271
  • 3,287
  • 2,928
  • 1
    You might as well first pass the html code through [hxnormalize](http://www.w3.org/Tools/HTML-XML-utils/man1/hxnormalize.html), which tries to fix small errors in the code. – marlenunez Mar 11 '14 at 13:33
  • 3
    Am I right in assuming that option `-c` has been forgotten? – Cyrus Feb 11 '18 at 20:24
1

For simple uses, you can use Ex editor, for example:

$ ex +'/<div/norm vity' +'%d|pu 0|%p' -scq! file.html
some other html stuff

where it finds div tag, then selecting inner HTML tag (vit) of found tag, yank it (y) in order to replace the buffer with it (%delete, put 0), then print it (%print), and quit (-cq!).

Other example with demo URL:

$ ex +'/<div/norm vity' +'%d|pu 0|%p' -Nscq! http://example.com/

The advantage is that ex is a standard Unix editor available in most Linux/Unix distributions.

See also:

Community
  • 1
  • 1
kenorb
  • 155,785
  • 88
  • 678
  • 743