2

I'm trying to parse a html file and I want to extract everything inside a outer div tag with a unique id. Sample:

<body>
  ...
  <div id="1">

    <div id="2">
    ...
    </div>

    <div id="3">
    ...
    </div>

  </div>
  ...
</body>

Here I want to extract every thing in between <div id="1"> and its corresponding </tag> NOT the first </div> tag.

I've gone through many older posts but they don't work because they stop when they see the first </div> tag which is not what I'm looking for.

Any pointer would be appreciated.

gameover
  • 11,813
  • 16
  • 59
  • 70

2 Answers2

7

It sounds like your problem is that you are trying to parse HTML using regular expressions.

Don't. Use an HTML parser. There are plenty on CPAN. I'm fond of HTML::TreeBuilder::XPath.

Community
  • 1
  • 1
Quentin
  • 914,110
  • 126
  • 1,211
  • 1,335
  • Hi. Thanks it works but the text returned by `$tree->findvalue( '//*[@id="container"]');` has words glued to each other that is there is no space between them. Do you know the fix ? – gameover Jan 16 '13 at 14:13
  • basically it is joining the html text lines with one another without any separator. – gameover Jan 16 '13 at 14:14
  • It would help if you updated your question with the code you're trying to use. – Craig Treptow Jan 16 '13 at 14:24
  • 1
    @gameover — use `find` to get nodes instead of a string. Then you can should be able to get an HTML representation of the node you get (I don't know the methods for that off the top of my head). – Quentin Jan 16 '13 at 14:26
2

Quentin has rightly mentioned using an HTML parser to extract div content. Here's one option using Mojo::DOM:

use strict;
use warnings;
use Mojo::DOM;

my $text = <<END;
<body>
  ...
  <div id="1">
Under div id 1
    <div id="2">
Under div id 2
    </div>

    <div id="3">
Under div id 3
    </div>

  </div>
Outside the divs
</body>
END

my $dom = Mojo::DOM->new($text);

print $dom->find('div[id=1]')->pluck('text');

Output:

Under div id 1
Community
  • 1
  • 1
Kenosis
  • 6,196
  • 1
  • 16
  • 16