0

I have a very basic html file called example.html (see below)

<html>
<body>
<div class="one">
    <div class="research">
        <div class="two">
            <p>Lorem ipsum...</p>
        </div>
        <div class="three">
            <p>Lorem ipsum...</p>
        </div>
        <div class="four">
            <p>Lorem ipsum...</p>
        </div>
    </div>  
</div>
</body>
</html>

and I'd like to get only phrase like (see below), but not by removing first and last 3 lines.

<div class="research">
    <p>Lorem ipsum...</p>
    <div class="two"></div>
    <div class="three"></div>
    <div class="four"></div>
</div>

I have tried with awk:

cat example.html | awk '/^<div\ class="research">$/,/^<\/div>$/ { print }'

but something seems to be wrong.

I also tried with body tag (see below)

cat example.html | awk '/^<body>$/,/^<\/body>$/ { print }'

(result)

<body>
<div class="one">
    <div class="research">
        <div class="two">
            <p>Lorem ipsum...</p>
        </div>
        <div class="three">
            <p>Lorem ipsum...</p>
        </div>
        <div class="four">
            <p>Lorem ipsum...</p>
        </div>
    </div>  
</div>
</body>

And it's working correctly.

What I've doing wrong?

Thanks in advance.

Egel
  • 1,796
  • 2
  • 24
  • 35
  • 2
    `/^
    $/` doesn't work because `
    – Barmar Aug 29 '13 at 18:21
  • Yeah! You have right, but still the last `` are in the game. So the question is how to select text to proper ending `div` tag? – Egel Aug 29 '13 at 18:28
  • You need to count all the matching
    and
    tags. You can't do this with a simple `first,last` pattern, you have to write `awk` code to increment a counter when you see another `
    `, and decrement it when you see a `
    `. When the counter goes to 0, you've matched the first one.
    – Barmar Aug 29 '13 at 18:29
  • As an aside, avoid the [Useless Use of `cat`](http://partmaps.org/era/unix/award.html). – tripleee Aug 29 '13 at 19:32

1 Answers1

6

You cannot parse HTML with regular expressions. Assuming the html is valid xml, you can use:

xmlstarlet sel -t -c '//div[@class="research"]' -nl example.html  
<div class="research">
        <div class="two">
            <p>Lorem ipsum...</p>
        </div>
        <div class="three">
            <p>Lorem ipsum...</p>
        </div>
        <div class="four">
            <p>Lorem ipsum...</p>
        </div>
    </div>
Community
  • 1
  • 1
glenn jackman
  • 238,783
  • 38
  • 220
  • 352
  • 1
    +1 You cannot parse HTML with regular expressions. (I just like repeating that). – msw Aug 29 '13 at 19:02