Extract text between two strings in simple example.html file

Question

I have a very basic html file called example.html (see below)

<html>
<body>
<div class="one">
    <div class="research">
        <div class="two">
            <p>Lorem ipsum...</p>
        </div>
        <div class="three">
            <p>Lorem ipsum...</p>
        </div>
        <div class="four">
            <p>Lorem ipsum...</p>
        </div>
    </div>  
</div>
</body>
</html>

and I'd like to get only phrase like (see below), but not by removing first and last 3 lines.

<div class="research">
    <p>Lorem ipsum...</p>
    <div class="two"></div>
    <div class="three"></div>
    <div class="four"></div>
</div>

I have tried with awk:

cat example.html | awk '/^<div\ class="research">$/,/^<\/div>$/ { print }'

but something seems to be wrong.

I also tried with body tag (see below)

cat example.html | awk '/^<body>$/,/^<\/body>$/ { print }'

(result)

<body>
<div class="one">
    <div class="research">
        <div class="two">
            <p>Lorem ipsum...</p>
        </div>
        <div class="three">
            <p>Lorem ipsum...</p>
        </div>
        <div class="four">
            <p>Lorem ipsum...</p>
        </div>
    </div>  
</div>
</body>

And it's working correctly.

What I've doing wrong?

Thanks in advance.

Yeah! You have right, but still the last `` are in the game. So the question is how to select text to proper ending `div` tag? — Egel, Aug 29 '13 at 18:28
You need to count all the matching
and
tags. You can't do this with a simple `first,last` pattern, you have to write `awk` code to increment a counter when you see another `
`, and decrement it when you see a `
`. When the counter goes to 0, you've matched the first one. — Barmar, Aug 29 '13 at 18:29
As an aside, avoid the [Useless Use of `cat`](http://partmaps.org/era/unix/award.html). — tripleee, Aug 29 '13 at 19:32

score 6 · Accepted Answer · edited May 23 '17 at 12:05

6

You cannot parse HTML with regular expressions. Assuming the html is valid xml, you can use:

xmlstarlet sel -t -c '//div[@class="research"]' -nl example.html

<div class="research">
        <div class="two">
            <p>Lorem ipsum...</p>
        </div>
        <div class="three">
            <p>Lorem ipsum...</p>
        </div>
        <div class="four">
            <p>Lorem ipsum...</p>
        </div>
    </div>

edited May 23 '17 at 12:05

Community

1
1

answered Aug 29 '13 at 18:56

glenn jackman

238,783
38
220
352

1

+1 You cannot parse HTML with regular expressions. (I just like repeating that). – msw Aug 29 '13 at 19:02

Extract text between two strings in simple example.html file

1 Answers1