How to extract text between particular HTML tag in script

Question

Given that I have some HTML in the form:

<html>
  <body>
    <div id="1" class="c">some other html stuff</div>
  </body>
</html>

How can I extract this with Unix script?

some other html stuff

score 3 · Accepted Answer · edited May 23 '17 at 10:30

3

You may checkout the html-xml-utils and the hxselect command which allows you to extract elements that match a CSS selector:

hxselect '.c' < test.htm

This assumes that your input is a well-formed XML document. If it is not you might need to resort to regular expressions and the possible consequences of that.

edited May 23 '17 at 10:30

Community

1
1

answered May 29 '12 at 07:06

Darin Dimitrov

1,023,142
271
3,287
2,928

1

You might as well first pass the html code through [hxnormalize](http://www.w3.org/Tools/HTML-XML-utils/man1/hxnormalize.html), which tries to fix small errors in the code. – marlenunez Mar 11 '14 at 13:33
3

Am I right in assuming that option `-c` has been forgotten? – Cyrus Feb 11 '18 at 20:24

score 1 · Answer 2 · edited Apr 13 '17 at 12:51

For simple uses, you can use Ex editor, for example:

$ ex +'/<div/norm vity' +'%d|pu 0|%p' -scq! file.html
some other html stuff

where it finds div tag, then selecting inner HTML tag (vit) of found tag, yank it (y) in order to replace the buffer with it (%delete, put 0), then print it (%print), and quit (-cq!).

Other example with demo URL:

$ ex +'/<div/norm vity' +'%d|pu 0|%p' -Nscq! http://example.com/

The advantage is that ex is a standard Unix editor available in most Linux/Unix distributions.

How to extract text between particular HTML tag in script

2 Answers2