i want to extract sentences that lie between SPAN and br. i am trying to do with HTML::TreeBuilder. and i am new to perl. any help will be appreaciated.
<p>
<SPAN class="verse" id="1">1 </SPAN> ଆରମ୍ଭରେ ପରମେଶ୍ବର ଆକାଶ ଓ ପୃଥିବୀକୁ ସୃଷ୍ଟି କଲେ।
<br><SPAN class="verse" id="2">2 </SPAN> ପୃଥିବୀ ସେତବେେଳେ ସଂପୂରନ୍ଭାବେ ଶୂନ୍ଯ ଓ କିଛି ନଥିଲା। ଜଳଭାଗ ଉପରେ ଅନ୍ଧକାର ଘାଡ଼ଇେେ ରଖିଥିଲା ଏବଂ ପରମେଶ୍ବରଙ୍କର ଆତ୍ମା ଜଳଭାଗ
<br><SPAN class="verse" id="3">3 </SPAN> ଉପରେ ବ୍ଯାପ୍ତ ଥିଲା।
<br><SPAN class="verse" id="4">4 </SPAN> ପରମେଶ୍ବର ଆଲୋକକୁ ଦେଖିଲେ ଏବଂ ସେ ଜାଣିଲେ, ତାହା ଉତ୍ତମ, ଏହାପ ରେ ପରମେଶ୍ବର ଆଲୋକକୁ ଅନ୍ଧକାରରୁ ଅଲଗା କଲେ।
</p>
what i've done
foreach $line (@lines)
{
# Now create a new tree to parse the HTML from String $str
my $tr = HTML::TreeBuilder->new_from_content($line);
# And now find all <p> tags and create an array with the values.
my @lists =
map { $_->content_list }
$tr->find_by_tag_name('p');
# And loop through the array returning our values.
foreach my $val (@lists) {
print $val, "\n";printf FILE1 "\n%s", $val ;
}
}
i am not able to skip those html tags nested in p tag. i want to extract only unicode text and skip nested tags.