How to match Chinese character using perl's regex

Question

I need to match some chinese character in a utf8 encoded html , and I wrote some test code as below :

#! /usr/bin/perl

use strict;
use LWP::UserAgent;
use Encode;

my $ua = new LWP::UserAgent;

my $request = HTTP::Request->new('GET');
my $url = 'http://www.boc.cn/sourcedb/whpj/';
$request->url($url);

my $res = $ua->request($request) ;

my $str_chinese =   encode("utf8" ,"英磅" ) ;  
# my $str_chinese = "英磅" ;


my $str_english = "English" ;
#my $html = decode("utf8" , $res->content) ;
my $html = $res->content ; 

if ( $html =~ /$str_chinese/ ) {
     print "chinese word matched" ;
}else {
     print "chinese word unmatched\n" ;
}

if ( $html =~ /$str_english/i ) {
    print "english word matched\n" ;
}else {
    print "english word unmatched\n" ;
}

The output shows that the the script fail to match the existing chinese characters embeded in the html. could you give me some hint on how to solve my problem ?

score 7 · Answer 1 · answered Dec 23 '09 at 10:08

7

Since you have added UTF-8 characters in the source code, you have to:

use utf8;

It tells Perl that your script is written in UTF-8.

answered Dec 23 '09 at 10:08

Alan Haggai Alavi

72,802
19
102
127

score 4 · Answer 2 · answered Dec 23 '09 at 10:13

I run your code and the Chinese characters are not matched.

Then I check the html, it does not contains these characters. So this may be the reason for non-matching case. I then tried for some other character (联) and also remove the encode function. i.e. my $str_chinese = "联";

Run the code with this change and the character is matched.

score 3 · Accepted Answer · answered Dec 23 '09 at 13:29

3

You should use the method decoded_content from the class HTTP::Message instead. Manual decoding is not necessary.

#!/usr/bin/env perl
use utf8;
use strict;
use LWP::UserAgent;

my $html = LWP::UserAgent->new
    ->get('http://www.boc.cn/sourcedb/whpj/')
    ->decoded_content;

my $str_chinese = '首页';
my $str_english = 'English';

if ($html =~ /$str_chinese/) {
    print "chinese word matched\n";
} else {
    print "chinese word unmatched\n";
}

if ($html =~ /$str_english/i) {
    print "english word matched\n";
} else {
    print "english word unmatched\n";
}

Output:

chinese word matched
english word matched

answered Dec 23 '09 at 13:29

daxim

39,270
4
65
132

@daxim:I can't run the aove script you provide under windows, perl complains that there are malformed utf8 characters . the editor I use is gvim version 7.2 . – Haiyuan Zhang Dec 23 '09 at 16:02
2

As I wrote earlier, you have to tell gvim to save the file as UTF-8. http://stackoverflow.com/questions/1945221#1945756 – daxim Dec 24 '09 at 12:46

How to match Chinese character using perl's regex

3 Answers3

Linked