0

I am developing a web crawler in Perl. It extracts contents from the page and then a pattern match is done to check the language of the content. Unicode values are used to match the content.

Sometimes the extracted content contains text in multiple languages. The pattern match I used here prints all the text, but I want to print only the text that matches the Unicode values specified in the pattern.

my $uu         = LWP::UserAgent->new('Mozilla 1.3');
my $extractorr = HTML::ContentExtractor->new();

# create response object to get the url
my $responsee = $uu->get($url);
my $contentss = $responsee->decoded_content();

$range = "([\x{0C00}-\x{0C7F}]+)";    # match particular language

if ($contentss =~ m/$range/) {
  $extractorr->extract($url, $contentss);
  print "$url\n";
  binmode(STDOUT, ":utf8");
  print $extractorr->as_text;
}
Borodin
  • 126,100
  • 9
  • 70
  • 144
Nagaraju
  • 1,853
  • 2
  • 27
  • 46

1 Answers1

3

It would be better to match characters with a particular Unicode property, rather than trying to formulate an appropriate character class.

The code points in the range 0x0C00...0x0C7F correspond to characters in Telugu (one of the Indian languages) which you can match using the regex /\p{Telugu}/.

The other properties you will probably need are /\p{Kannada}/, /\p{Malayalam}/, /\p{Devanagari}/, and /\p{Tamil}/

Borodin
  • 126,100
  • 9
  • 70
  • 144
  • 1
    Are you are still using the `$range` variable? I expected that you would write just `if ($contentss =~ /(\p{Telugu}+)/) {...}`. If you want to put the regex into a variable, then you must remove the square brackets (as they contain just a list of characters, and you can't put Unicode properties inside) and use single quotes instead of double (as otherwise the backslash will be swallowed). So `my $range = '(\p{Telugu}+)'`. – Borodin Oct 03 '13 at 12:00
  • I am not using $range variable.Tried my $range = '(\p{Telugu}+)' but i get the same result – Nagaraju Oct 07 '13 at 04:41
  • Try using `\p{InTelugu}` instead. If that doesn't work then you need to show your code. – Borodin Oct 07 '13 at 07:38
  • It worked when I did this @cont = split(/\n/,$extractorr->as_text); print "@_\n"; foreach $cont (@cont) { if($cont =~ m/\p{Telugu}/) { binmode(STDOUT, ":utf8"); print "$cont\n"; } } – Nagaraju Oct 09 '13 at 06:28