I was writing a web crawler using PERL, and I realized there was a weird behavior when I try to display string using HTML::Entities::decode_entities.
I was handling strings that contain contain Chinese characters and strings like Jìngyè. I used HTML::Entities::decode_entities to decode chinese characters, which works well. However, when the string contain no Chinese characters, the string displayed weirdly (J�ngy�).
I wrote a small code to test different behaviors on 2 strings.
String 1 is "No. 22, J�ngy� 3rd Road, Jhongshan District, Taipei City, Taiwan 10466" and string 2 was "104 Taiwan Taipei City Jhongshan District J�ngy� 3rd Road 20號".
Below is my code:
print "before: $1\n";
my $decoded = HTML::Entities::decode_entities($1."號");#I add the last character just for testing
print "decoded $decoded\n";
my $chopped = substr($decoded, 0, -1);
print "chopped: $chopped\n";
These are my results:
before: No. 22, J�ngy� 3rd Road, Jhongshan District, Taipei City, Taiwan 10466
decoded No. 22, Jìngyè 3rd Road, Jhongshan District, Taipei City, Taiwan 10466號 (correct)
chopped: No. 22, J�ngy� 3rd Road, Jhongshan District, Taipei City, Taiwan 10466 (incorrect)
before: 104 Taiwan Taipei City Jhongshan District J�ngy� 3rd Road 20號
decoded 104 Taiwan Taipei City Jhongshan District Jìngyè 3rd Road 20號號 (correct)
chopped: 104 Taiwan Taipei City Jhongshan District Jìngyè 3rd Road 20號 (correct)
Can someone please explain me why was this happening? And how to solve this so that my String will display properly.
Thank you very much.
Sorry, I did not make my question clear, below is the code I wrote, where URL is http://maps.google.com/maps/place?cid=10931902633578573013:
sub getInfoURLs {
my ($url) = @_;
unless (defined $url){
print 'URL was not defined when extracting info\n';
return 0;
}
my $contain_request = LWP::UserAgent->new->get($url);
if($contain_request -> is_success){
my $contain_content = $contain_request -> decoded_content;
#store address
if ($contain_content =~ m/$address_pattern/i){
print "before: $1\n";
my $decoded = HTML::Entities::decode_entities($1."號");
print "decoded $decoded\n";
my $chopped = substr($decoded, 0, -1);
print "chopped: $chopped\n";
#unicode conversion
#store in database
}
}
}