Given that I have some HTML in the form:
<html>
<body>
<div id="1" class="c">some other html stuff</div>
</body>
</html>
How can I extract this with Unix script?
some other html stuff
Given that I have some HTML in the form:
<html>
<body>
<div id="1" class="c">some other html stuff</div>
</body>
</html>
How can I extract this with Unix script?
some other html stuff
You may checkout the html-xml-utils and the hxselect
command which allows you to extract elements that match a CSS selector:
hxselect '.c' < test.htm
This assumes that your input is a well-formed XML document. If it is not you might need to resort to regular expressions and the possible consequences of that.
For simple uses, you can use Ex editor, for example:
$ ex +'/<div/norm vity' +'%d|pu 0|%p' -scq! file.html
some other html stuff
where it finds div
tag, then selecting inner HTML tag (vit
) of found tag, yank it (y
) in order to replace the buffer with it (%delete
, put 0
), then print it (%print
), and quit (-cq!
).
Other example with demo URL:
$ ex +'/<div/norm vity' +'%d|pu 0|%p' -Nscq! http://example.com/
The advantage is that ex
is a standard Unix editor available in most Linux/Unix distributions.
See also: