0

I need to find through regular expression from <div id="class1"> to end of </div>. I may also have as many <div> within its text inside it. Please find the code below

This is example <div id="class1">This is <div id="subclass1">This is </div> <div id="subclass2">This is </div> This is </div> This is example

I have tried the below code. But it gets only up to first </div> of <div id="subclass1"> Could any help me to solve this?

Code I tried to capture is:

<div id="class1">(?:(?!<\/div>).)*?</div>
siva2012
  • 457
  • 4
  • 19
  • 3
    Please don't try to parse HTML with regexes. Regexes are not up to the task. Use an HTML parser. http://htmlparsing.com/perl.html has some examples for Perl. – Andy Lester Dec 08 '12 at 03:15
  • Obligatory link: http://stackoverflow.com/questions/1732348 - Read the answer to this question – Jim Garrison Dec 08 '12 at 04:10
  • Like most of the people said, there are a lot of HTML/XML modules in Perl, but if you want to feel like you built it, may be you will like **Parse::RecDescent** – fersarr Dec 08 '12 at 05:02

4 Answers4

4

Use a proper HTML parser.

use strict;
use warnings;
use feature qw( say );

use XML::LibXML qw( );

my $html = 'This is example <div id="class1">This is <div id="subclass1">This is </div> <div id="subclass2">This is </div> This is </div> This is example';

my $parser = XML::LibXML->new();
my $doc    = $parser->parse_html_string($html);
my $root   = $doc->documentElement();

for my $div ($root->findnodes('//div[@id="class1"]')) {
   say "[", $div->toString(), "]";
}
ikegami
  • 367,544
  • 15
  • 269
  • 518
0
$ echo 'This is example <div id="class1">This is <div id="subclass1">This is </div> <div id="subclass2">This is </div> This is </div> This is example' | sed -n 's/<div id="class1">\(.*\)<\/div>/\1/p'
This is example This is <div id="subclass1">This is </div> <div id="subclass2">This is </div> This is  This is example
palako
  • 3,342
  • 2
  • 23
  • 33
0

You should use appropriate HTML/XML parser. If you want to do it with regex for any reason, nested regex helps you. (Check perldoc perlre for detail.)

$re = qr{
  (
    <div[^>]*>
    (?:(??{$re}) | [^<>]*)*
    </div>
  )
}x;

print "$1\n" if(/$re/o);
yasu
  • 1,374
  • 8
  • 16
0

A lot of people always say "Use a proper HTML parser" to parse HTML and not regex. What some people fail to realize is that there are requirements to be met and those requirements might require regex.

<div id=".+?">.*</div> should work for you.

http://regexr.com?33336

Jack
  • 5,680
  • 10
  • 49
  • 74