0

I need some suggestion in parsing a html content,need to extract the id of tag <\a> inside a div, and store it into an variable specific variable. i have tried to make a regular expression for this but its getting the id of tag in all div. i need to store the ids of tag<\a> which is only inside a specific div .

The HTML content is

<div class="m_categories" id="part_one">
<ul>
<li>-
<a href="#" class="sel_cat " id="sel_cat_10018">aaa</a>
</li>
<li>-
<a href="#" class="sel_cat " id="sel_cat_10007">bbb</a>
</li>
.
.
.
</div>

<div class="m_categories hidden" id="part_two">
<ul>
<li>-
<a href="#" class="sel_cat " id="sel_cat_10016">ccc</a>
</li>
<li>-
<a href="#" class="sel_cat " id="sel_cat_10011">ddd</a>
</li>
<li>-
<a href="#" class="sel_cat " id="sel_cat_10025">eee</a>
</li>
.
.
</div>

Need some suggestion, Thanks in advance

update: the regex i have used

if($content=~m/sel_cat " id="([^<]*?)"/is){}

while($content=~m/sel_cat " id="([^<]*?)"/igs){}

Balakumar
  • 650
  • 1
  • 12
  • 29
  • A proper html parser would be easier I think. If you still want to use regex... post the regex you've been trying. – Jerry Aug 30 '13 at 19:20
  • 1
    Obligatory: [You can't parse \[X\]HTML with regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). "Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. " – DVK Aug 30 '13 at 19:23

2 Answers2

2

You should really look into HTML::Parser rather than trying to use a regex to extract bits of HTML.

one way to us it to extract the id element from each div tag would be:

# This parser only looks at opening tags
sub start_handler { 
my ($self, $tagname, $attr, $attrseq, $origtext) = @_;
if ($tagname eq 'div') { # is it a div element?
        if($attr->{ id }) {  # does div have an id?
            print "div id found: ", $attr->{ id }, "\n";
        }       
}
}
my $html = &read_html_somehow() or die $!;

my $p = HTML::Parser->new(api_version => 3);
$p->handler( start => \&start_handler );
$p->parse($html);

This is a lot more robust and flexible than a regex-based approach.

smocking
  • 3,689
  • 18
  • 22
1

There are so many great HTML parser around. I kind of like the Mojo suite, which allows me to use CSS selectors to get a part of the DOM:

use Mojo;

my $dom = Mojo::DOM->new($html_content);

say for $dom->find('a.sel_cat')->all_text;
# Or, more robust:
# say $_->all_text for $dom->find('a.sel_cat')->each;

Output:

aaa
bbb
ccc
ddd
eee

Or for the IDs:

say for $dom->find('a.sel_cat')->attr('id');
# Or, more robust_
# say $_->attr('id') for $dom->find('a.sel_cat')->each;

Output:

sel_cat_10018
sel_cat_10007
sel_cat_10016
sel_cat_10011
sel_cat_10025

If you only want those ids in the part_two div, use the selector #part_two a.sel_cat.

amon
  • 57,091
  • 2
  • 89
  • 149
  • Thanks @Amon but i am facing an error `can't locate object method "all_text" via package "Mojo::Collection "` but i have already installed the package,. how can i solve this issue – Balakumar Aug 31 '13 at 17:33
  • @Balakumar Here you go. There was a silly typo (*car* instead of *cat*), which had the query return an empty collection. I corrected that, and also added versions that don't have problems with empty results. – amon Aug 31 '13 at 17:43