4

I have an encoding issue in perl when trying to pull back global addresses from webpages using both LWP::Useragent and Encode for character encoding. I've tried googling solutions but nothing seems to work. I'm using Strawberry Perl 5.12.3.

As an example take the address page of the US embassy in Czech Republic (http://prague.usembassy.gov/contact.html). All I want is to pull back the address:

Address: Tržiště 15 118 01 Praha 1 - Malá Strana Czech Republic

Which firefox displays correctly using character encoding UTF-8 which is the same as the webpage header char-set. But when I try to use perl to pull this back and write it to a file the encoding looks messed up despite using decoded_content in Useragent or Encode::decode.

I've tried using regex on the data to check the error isn't when the data is printed (ie internally correct in perl) but the error seems to be in how perl handles the encoding.

Here's my code:

#!/usr/bin/perl

require Encode;
require LWP::UserAgent;
use utf8;

my $ua = LWP::UserAgent->new;
$ua->timeout(30);
$ua->env_proxy;

my $output_file;
$output_file = "C:/Documents and Settings/ian/Desktop/utf8test.txt";
open (OUTPUTFILE, ">$output_file") or die("Could not open output file $output_file: $!" );
binmode OUTPUTFILE, ":utf8";
binmode STDOUT, ":utf8";

# US embassy in Czech Republic webpage
$url = "http://prague.usembassy.gov/contact.html";

$ua_response = $ua->get($url);
if (!$ua_response->is_success) { die "Couldn't get data from $url";}

print 'CONTENT TYPE: '.$ua_response->content_charset."\n";
print OUTPUTFILE 'CONTENT TYPE: '.$ua_response->content_charset."\n";

my $content_not_decoded;
my $content_ua_decoded;
my $content_Endode_decoded;
my $content_double_decoded;

$ua_response->content =~ /<p><b>Address(.*?)<\/p>/;
$content_not_decoded = $1;
$ua_response->decoded_content =~ /<p><b>Address(.*?)<\/p>/;
$content_ua_decoded = $1;
Encode::decode_utf8($ua_response->content) =~ /<p><b>Address(.*?)<\/p>/;
$content_Endode_decoded = $1;
Encode::decode_utf8($ua_response->content) =~ /<p><b>Address(.*?)<\/p>/;
$content_double_decoded = $1;

# get the content without decoding
print 'UNDECODED CONTENT:'.$content_not_decoded."\n";
print OUTPUTFILE 'UNDECODED CONTENT:'.$content_not_decoded."\n";

# print the decoded content
print 'DECODED CONTENT:'.$content_ua_decoded."\n";
print OUTPUTFILE 'DECODED CONTENT:'.$content_ua_decoded."\n";

# use Encode to decode the content
print 'ENCODE::DECODED CONTENT:'.$content_Endode_decoded."\n";
print OUTPUTFILE 'ENCODE::DECODED CONTENT:'.$content_Endode_decoded."\n";

# try both!
print 'DOUBLE-DECODED CONTENT:'.$content_double_decoded."\n";
print OUTPUTFILE 'DOUBLE-DECODED CONTENT:'.$content_double_decoded."\n";

# check for #-digit character in the strings (to guard against the error coming in the print statement) 
if ($content_not_decoded =~ /\&/) {
    print "AMPERSAND FOUND IN UNDECODED CONTENT- LIKELY ENCODING ERROR\n";
    print OUTPUTFILE "AMPERSAND FOUND IN UNDECODED CONTENT- LIKELY ENCODING ERROR\n";
}
if ($content_ua_decoded =~ /\&/) {
    print "AMPERSAND FOUND IN DECODED CONTENT- LIKELY ENCODING ERROR\n"; 
    print OUTPUTFILE "AMPERSAND FOUND IN DECODED CONTENT- LIKELY ENCODING ERROR\n"; 
}
if ($content_Endode_decoded =~ /\&/) {
    print "AMPERSAND FOUND IN ENCODE::DECODED CONTENT- LIKELY ENCODING ERROR\n";
    print OUTPUTFILE "AMPERSAND FOUND IN ENCODE::DECODED CONTENT- LIKELY ENCODING ERROR\n";
}
if ($content_double_decoded =~ /\&/) {
    print "AMPERSAND FOUND IN DOUBLE-DECODED CONTENT- LIKELY ENCODING ERROR\n";
    print OUTPUTFILE "AMPERSAND FOUND IN DOUBLE-DECODED CONTENT- LIKELY ENCODING ERROR\n";
}

close (OUTPUTFILE);
exit;

And here's the output to terminal:

CONTENT TYPE: UTF-8 UNDECODED CONTENT::
Tr├à┬╛išt├ä┬¢ 15
118 01 Praha 1 - Malá Strana
Czech Republic DECODED CONTENT::
Tr┼╛išt─¢ 15
118 01 Praha 1 - Malá Strana
Czech Republic ENCODE::DECODED CONTENT::
Tr┼╛išt─¢ 15
118 01 Praha 1 - Malá Strana
Czech Republic DOUBLE-DECODED CONTENT::Tr┼╛išt─¢ 15
118 01 Praha 1 - Malá StranaCzech Republic AMPERSAND FOUND IN UNDECODED CONTENT- LIKELY ENCODING ERROR AMPERSAND FOUND IN DECODED CONTENT- LIKELY ENCODING ERROR AMPERSAND FOUND IN ENCODE::DECODED CONTENT- LIKELY ENCODING ERROR AMPERSAND FOUND IN DOUBLE-DECODED CONTENT- LIKELY ENCODING ERROR

And to the file (note this is slightly different to terminal but not correct). OK WOW- this is showing as correct in stack overflow but not in Bluefish, LibreOffice, Excel, Word or anything else on my computer. So the data is there just incorrectly encoded. I really don't get what's going on.

CONTENT TYPE: UTF-8 UNDECODED CONTENT::
TržištÄ 15
118 01 Praha 1 - Malá Strana
Czech Republic DECODED CONTENT::
Tržiště 15
118 01 Praha 1 - Malá Strana
Czech Republic ENCODE::DECODED CONTENT::
Tržiště 15
118 01 Praha 1 - Malá Strana
Czech Republic DOUBLE-DECODED CONTENT::Tržiště 15
118 01 Praha 1 - Malá StranaCzech Republic AMPERSAND FOUND IN UNDECODED CONTENT- LIKELY ENCODING ERROR AMPERSAND FOUND IN DECODED CONTENT- LIKELY ENCODING ERROR AMPERSAND FOUND IN ENCODE::DECODED CONTENT- LIKELY ENCODING ERROR AMPERSAND FOUND IN DOUBLE-DECODED CONTENT- LIKELY ENCODING ERROR

Any pointers how this can be made really appreciated.

Thanks, Ian/Montecristo

Montecristo
  • 103
  • 2
  • 8

2 Answers2

5

The mistake is using regex to parse HTML. You lack decoding of HTML entities, at the least. You can do that manually, or leave it to a robust parser:

use strictures;
use Web::Query 'wq';
use autodie qw(:all);

open my $output, '>:encoding(UTF-8)', '/tmp/embassy-prague.txt';
print {$output} wq('http://prague.usembassy.gov/contact.html')->find('p')->first->html; # or perhaps ->text
daxim
  • 39,270
  • 4
  • 65
  • 132
  • Ok great- it looks like I've got some reading up to do! But thanks for the pointer I'll follow up from here. Out of interest I was thinking of making the move to Python or Ruby- you can see I'm not a power user of perl. Would those be able to handle utf8 in a more elegant way? – Montecristo Jun 27 '12 at 08:15
  • I don't see what's not elegant about the [IO layer](http://p3rl.org/PerlIO) I used in the code. – Python 2 has [ridiculous defects](https://github.com/thp/python2sucks) surrounding the encoding complex. Python 3 is usable, but still many years after the release, library support is lacking, putting you between Scylla and Charybdis. – From the [Unicode shootout](http://training.perl.com/OSCON2011) you can see that Ruby still has a long way to catch up with Unicode, but at least the encoding support is nice. – daxim Jun 27 '12 at 09:39
  • 6
    @Montecristo, try to move - and will find than perl's unicode support is the most advanced and most powerful. Simply, use 5.14. I walked a long path perl -> python -> ruby -> perl. (wasted time). – kobame Jun 27 '12 at 10:34
  • Thanks @daxim. I'll stick with perl and try to learn more. Really great help. – Montecristo Jun 27 '12 at 12:01
  • 2
    @Montecristo, the only "problem" with perl's unicode suport is, than perl doing it right. So, when doing it right, here are no shortcuts. Many languages has shortcuts, so in the 1st usage they're seems be easier. But later, you find their limits. Simply, unicode is a complex thing, perl must maintain backward compatibility with 20k+ CPAN modules and so on. Therefore, (at the start) things seems be to complicated. Unfortnately - if you want write correct unicode programs, simply need learn what unicode is. Read the famous tchrist's post: http://stackoverflow.com/a/6163129/632407 – kobame Jun 27 '12 at 12:55
2
#!/usr/bin/env perl

use v5.12;
use strict;
use warnings;
use warnings qw(FATAL utf8);
use open     qw(:std :utf8);

use LWP::Simple;
use HTML::Entities;

my $content = get 'http://prague.usembassy.gov/contact.html';

my ($address) = ($content =~  m{<p><b>Address(.*?)</p>});
decode_entities($address);

say $address;

From the command line:

C:\temp> uu > tt.txt

C:\temp> gvim tt.txt

and the following text is displayed in GVim (which is UTF8 mode):

</b>:<br />Tržiště 15<br />118 01 Praha 1 - Malá Strana<br />Czech Republic

See also Tom Christiansen's standard preamble.

Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339