Perl regexp to find an element inside an element

Question

I need to find through regular expression from <div id="class1"> to end of </div>. I may also have as many <div> within its text inside it. Please find the code below

This is example <div id="class1">This is <div id="subclass1">This is </div> <div id="subclass2">This is </div> This is </div> This is example

I have tried the below code. But it gets only up to first </div> of <div id="subclass1"> Could any help me to solve this?

Code I tried to capture is:

<div id="class1">(?:(?!<\/div>).)*?</div>

Please don't try to parse HTML with regexes. Regexes are not up to the task. Use an HTML parser. http://htmlparsing.com/perl.html has some examples for Perl. — Andy Lester, Dec 08 '12 at 03:15
Obligatory link: http://stackoverflow.com/questions/1732348 - Read the answer to this question — Jim Garrison, Dec 08 '12 at 04:10
Like most of the people said, there are a lot of HTML/XML modules in Perl, but if you want to feel like you built it, may be you will like **Parse::RecDescent** — fersarr, Dec 08 '12 at 05:02

score 4 · Accepted Answer · answered Dec 08 '12 at 04:54

4

Use a proper HTML parser.

use strict;
use warnings;
use feature qw( say );

use XML::LibXML qw( );

my $html = 'This is example <div id="class1">This is <div id="subclass1">This is </div> <div id="subclass2">This is </div> This is </div> This is example';

my $parser = XML::LibXML->new();
my $doc    = $parser->parse_html_string($html);
my $root   = $doc->documentElement();

for my $div ($root->findnodes('//div[@id="class1"]')) {
   say "[", $div->toString(), "]";
}

answered Dec 08 '12 at 04:54

ikegami

367,544
15
269
518

Thanks for your source code. Is it possible this through regular expression – siva2012 Dec 10 '12 at 06:25
Sure, wrap the whole thing with `'' =~ /(?{ ... })/;` – ikegami Dec 10 '12 at 06:56

score 0 · Answer 2 · answered Dec 08 '12 at 02:54

$ echo 'This is example <div id="class1">This is <div id="subclass1">This is </div> <div id="subclass2">This is </div> This is </div> This is example' | sed -n 's/<div id="class1">\(.*\)<\/div>/\1/p'
This is example This is <div id="subclass1">This is </div> <div id="subclass2">This is </div> This is  This is example

score 0 · Answer 3 · answered Dec 08 '12 at 03:34

You should use appropriate HTML/XML parser. If you want to do it with regex for any reason, nested regex helps you. (Check perldoc perlre for detail.)

$re = qr{
  (
    <div[^>]*>
    (?:(??{$re}) | [^<>]*)*
    </div>
  )
}x;

print "$1\n" if(/$re/o);

score 0 · Answer 4 · answered Dec 09 '12 at 12:21

A lot of people always say "Use a proper HTML parser" to parse HTML and not regex. What some people fail to realize is that there are requirements to be met and those requirements might require regex.

<div id=".+?">.*</div> should work for you.

http://regexr.com?33336

Perl regexp to find an element inside an element

4 Answers4