3

i want to extract sentences that lie between SPAN and br. i am trying to do with HTML::TreeBuilder. and i am new to perl. any help will be appreaciated.

<p>
<SPAN class="verse" id="1">1 </SPAN> ଆରମ୍ଭରେ ପରମେଶ୍ବର ଆକାଶ ଓ   ପୃଥିବୀକୁ ସୃଷ୍ଟି କଲେ।
<br><SPAN class="verse" id="2">2 </SPAN> ପୃଥିବୀ ସେତବେେଳେ ସଂପୂରନ୍ଭାବେ ଶୂନ୍ଯ ଓ କିଛି ନଥିଲା। ଜଳଭାଗ ଉପରେ ଅନ୍ଧକାର ଘାଡ଼ଇେେ ରଖିଥିଲା ଏବଂ ପରମେଶ୍ବରଙ୍କର ଆତ୍ମା ଜଳଭାଗ
<br><SPAN class="verse" id="3">3 </SPAN> ଉପରେ ବ୍ଯାପ୍ତ ଥିଲା।
<br><SPAN class="verse" id="4">4 </SPAN> ପରମେଶ୍ବର ଆଲୋକକୁ ଦେଖିଲେ ଏବଂ ସେ ଜାଣିଲେ, ତାହା ଉତ୍ତମ, ଏହାପ ରେ ପରମେଶ୍ବର ଆଲୋକକୁ ଅନ୍ଧକାରରୁ ଅଲଗା କଲେ।
</p>

what i've done

 foreach $line (@lines)
    {
        # Now create a new tree to parse the HTML from String $str
        my $tr = HTML::TreeBuilder->new_from_content($line);

        # And now find all <p> tags and create an array with the values.
        my @lists = 
              map { $_->content_list } 
              $tr->find_by_tag_name('p');

        # And loop through the array returning our values.
        foreach my $val (@lists) {
        print $val, "\n";printf FILE1  "\n%s", $val ;
        }   


    }

i am not able to skip those html tags nested in p tag. i want to extract only unicode text and skip nested tags.

Vishal Maral
  • 1,279
  • 1
  • 10
  • 30

2 Answers2

1

I would use XML::Twig, just because I am familiar with it. Under the hood it uses HTML::TreeBuilder to convert HTML to XHTML.

A simple solution to your problem would be this:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

binmode( STDOUT, ':utf8'); # to avoid warnings when printing out wide (multi-byte) characters


my $file= shift @ARGV;

my $t= XML::Twig->new->parsefile_html( $file);

foreach my $p ($t->descendants( 'p'))
  { $p->cut_children( 'span');              # HTML::TreeBuilder lowercases tags
    my @texts= $p->children_text( '#TEXT'); # just get the text
    print join "---\n", @texts;             # or do whatever with the text
  }
mirod
  • 15,923
  • 3
  • 45
  • 65
  • thanks! if i dont know what the children tags are(like in this case i know there is only one tag and that is SPAN tag) and how many are then how can i modify above code to just cut all children tags and only keep text that is direct under parent tag ? – Vishal Maral Jan 13 '14 at 04:39
  • yes, you can do `$p->cut_children( '#ELT');` to cut all nested elements, but then you also cut the
    elements and you will get just one text, or you can do `foreach my $child ($p->children( '#ELT')) { $child->cut unless $child->tag eq 'br'; }` to keep the breaks.
    – mirod Jan 13 '14 at 07:58
-1

You could use regexp, of course :-)

while ( $html =~ s!<span[^>]*>.*?</span>([^>]*)<br>!$1! ){
  my $text = $1;
}

Fixing the original code still easy using regexps.

    # And loop through the array returning our values.
    foreach my $val (@lists) {
        $val =~ s!<[^>]*>!!gis;
        print $val, "\n";printf FILE1  "\n%s", $val ;
    }  

Regexp is not evil: http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html

Regular expressions are like a particularly spicy hot sauce – to be used in moderation and with restraint only when appropriate.

user1126070
  • 5,059
  • 1
  • 16
  • 15
  • @user1126070: thanks. i was able to get what i wanted. but i am looking for more volatile solution with the help of html parsing which i can use with different webpages to get text nested between html tags – Vishal Maral Jan 09 '14 at 11:48
  • 1
    Regexp is not evil: http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html – user1126070 Jan 09 '14 at 14:05
  • @user1126070 It's not evil when used for parsing [regular languages](http://stackoverflow.com/questions/6718202/what-is-a-regular-language), but HTML is not a regular language, and therefore it cannot be parsed by regular expressions. – Kevin Panko Feb 28 '14 at 15:53