1

I was writing a web crawler using PERL, and I realized there was a weird behavior when I try to display string using HTML::Entities::decode_entities.

I was handling strings that contain contain Chinese characters and strings like Jìngyè. I used HTML::Entities::decode_entities to decode chinese characters, which works well. However, when the string contain no Chinese characters, the string displayed weirdly (J�ngy�).

I wrote a small code to test different behaviors on 2 strings.

String 1 is "No. 22, J�ngy� 3rd Road, Jhongshan District, Taipei City, Taiwan 10466" and string 2 was "104 Taiwan Taipei City Jhongshan District J�ngy� 3rd Road 20號".

Below is my code:

print "before: $1\n";
my $decoded = HTML::Entities::decode_entities($1."&#34399");#I add the last character just for testing
print "decoded $decoded\n";
my $chopped = substr($decoded, 0, -1);
print "chopped: $chopped\n";

These are my results:

before: No. 22, J�ngy� 3rd Road, Jhongshan District, Taipei City, Taiwan 10466

decoded No. 22, Jìngyè 3rd Road, Jhongshan District, Taipei City, Taiwan 10466號 (correct)

chopped: No. 22, J�ngy� 3rd Road, Jhongshan District, Taipei City, Taiwan 10466 (incorrect)

before: 104 Taiwan Taipei City Jhongshan District J�ngy� 3rd Road 20號

decoded 104 Taiwan Taipei City Jhongshan District Jìngyè 3rd Road 20號號 (correct)

chopped: 104 Taiwan Taipei City Jhongshan District Jìngyè 3rd Road 20號 (correct)

Can someone please explain me why was this happening? And how to solve this so that my String will display properly.

Thank you very much.

Sorry, I did not make my question clear, below is the code I wrote, where URL is http://maps.google.com/maps/place?cid=10931902633578573013:

sub getInfoURLs {
my ($url) = @_;
unless (defined $url){
    print 'URL was not defined when extracting info\n';
    return 0;
}

my $contain_request = LWP::UserAgent->new->get($url);
if($contain_request -> is_success){
    my $contain_content = $contain_request -> decoded_content;

    #store address
    if ($contain_content =~ m/$address_pattern/i){
        print "before: $1\n";
        my $decoded = HTML::Entities::decode_entities($1."&#34399");
        print "decoded $decoded\n";
        my $chopped = substr($decoded, 0, -1);
        print "chopped: $chopped\n";
        #unicode conversion
        #store in database            
    }
 }
}
hook38
  • 3,899
  • 4
  • 32
  • 52
  • Have you looked at [this question and its answer](http://stackoverflow.com/questions/2725893/how-do-i-split-chinese-characters-one-by-one)? – MarcoS Sep 12 '11 at 10:06
  • What does your input with entities look like? Your examples display with invalid characters in boxes for me; perhaps that's all the explanation there is? Perhaps you need to type them in like &entity; in order for the markup to not eat them. – tripleee Sep 12 '11 at 11:38
  • Sorry, � in J�ngy� is already a replacement character, you broke the string by treating it inappropriately. Show the full code so we can reproduce the problem on our own. It is especially interesting how `$1` is filled and what it looks like as dumped with [Devel::Peek](http://p3rl.org/Devel::Peek)::Dump(). – daxim Sep 12 '11 at 13:50
  • sorry, information were extracted from a web site using LWP->decoded_content. And $1 was obtained using simple regular expression. – hook38 Sep 12 '11 at 14:14

1 Answers1

2

First, always use use strict; use warnings;!!!

The problem is that you're not encoding your output. File handles can only transmit bytes, but you're passing decoded text.

Perl will output UTF-8 (-ish) when you pass something that's obviously wrong. chr(0x865F) is obviously not a byte, so:

$ perl -we'print "\xE8\x{865F}\n"'
Wide character in print at -e line 1.
è號

But it's not always obvious that something is wrong. chr(0xE8) could be a byte, so:

$ perl -we'print "\xE8\n"'
�

The process of converting a value into to a series of bytes is called "serialization". The specific case of serializing text is known as character encoding.

Encode's encode is used to provide character encoding. You can also have encode called automatically using the open module.

$ perl -we'use open ":std", ":locale"; print "\xE8\x{865F}\n"'
è號

$ perl -we'use open ":std", ":locale"; print "\xE8\n"'
è
ikegami
  • 367,544
  • 15
  • 269
  • 518