3

Lets say i have this code:

use strict;
use LWP qw ( get );

my $content = get ( "http://www.msn.co.il" );

print STDERR $content;

The error log shows something like "\xd7\x9c\xd7\x94\xd7\x93\xd7\xa4\xd7\xa1\xd7\x94" which i'm guessing it's utf-16 ?

The website's encoding is with

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1255">

so why these characters appear and not the windows-1255 chars ?

And, another weird thing is that i have two servers:

the first server returning CP1255 chars and i can simply convert it to utf8, and the current server gives me these chars and i can't do anything with it ...

is there any configuration file in apache/perl/module that is messing up the encoding ? forcing something ... ?

The result in my website at the second server, is that the perl file and the headers are all utf8, so when i write text that aren't english chars, the content from the example above is showing ok ( even though it's weird utf chars ) but my own static text are look like "×ס'××ר××:"

One more thing that i tested is ...

Through perl:

my $content = `curl "http://www.anglo-saxon.co.il"`;    

I get utf8 encoding.

Through Bash:

curl "http://www.anglo-saxon.co.il"

and here i get CP1255 ( Windows-1255 ) encoding ...

Also, when i run the script in bash - it gives CP1255, and when run it through the web - then it's utf8 again ...

fixed the problem by changin the content from utf8 - to what is supposed to, and then back to utf8:

use Text::Iconv;

my $converter = Text::Iconv->new("utf8", "CP1255");
   $content=$converter->convert($content);

my $converter = Text::Iconv->new("CP1255", "utf8");
   $content=$converter->convert($content);
brian d foy
  • 129,424
  • 31
  • 207
  • 592
Ricky Levi
  • 7,298
  • 1
  • 57
  • 65

4 Answers4

8

All of this manual encoding and decoding is unnecessary. The HTML is lying to you when it says that the page is encoded in windows-1255; the server says it's serving UTF-8, and it is. Blame Microsoft HTML-generation tools.

Anyway, since the server does return the correct encoding, this works:

my $response = LWP::UserAgent->new->get("http://www.msn.co.il/");
my $content = $res->decoded_content;

$content is now a perl character string, ready to do whatever you need. If you want to convert it to some other encoding, then calling Encode::encode on it is appropriate; do not use Encode::decode as it's already been decoded once.

hobbs
  • 223,387
  • 19
  • 210
  • 288
5

http://www.msn.co.il is in UTF-8, and indicates that properly. The string "\xd7\x9c\xd7\x94\xd7\x93\xd7\xa4\xd7\xa1\xd7\x94" is also proper UTF-8 (להדפסה). I don't see the problem.

I think your second problem is due to you mixing different encodings (UTF-8 and Windows-1252). You might want to encode/decode your strings properly.

Leon Timmermans
  • 30,029
  • 2
  • 61
  • 110
3

First, note that you should import get from LWP::Simple. Second, everything works fine with:

#!/usr/bin/perl
use strict; use warnings;
use LWP::Simple qw ( getstore );
getstore 'http://www.msn.co.il', 'test.html';

which indicates to me that the problem is the encoding of the filehandle to which you are sending the output.

Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
2

The string with the hex values that you gave appears to be a UTF-8 encoding. You are getting this because Perl ‘likes to’ use UTF-8 when it deals with strings. The LWP::Simple->get() method automatically decodes the content from the server which includes undoing any Content-Encoding as well as converting to UTF-8.

You could dig into the internals and get a version that does change the character encoding (see HTTP::Message's decoded_content, which is used by HTTP::Response's decoded_content, which you can get from LWP::UserAgent's get). But it may be easier to re-encode the data in your desired encoding with something like

use Encode; 
...; 
$cp1255_bytes = encode('CP1255', decode('UTF_8', $utf8_bytes));

The mixed readable/garbage characters you see are due to mixing multiple, incompatible encodings in the same stream. Probably the stream is labeled as UTF-8 but you are putting CP1255 encoded characters into it. You either need to label the stream as CP1255 and put only CP1255-encoded data into it, or label it as UTF-8 and put only UTF-8-encoded data into it. Remind yourself that bytes are not characters and convert between them appropriately.

HoldOffHunger
  • 18,769
  • 10
  • 104
  • 133
Chris Johnsen
  • 214,407
  • 26
  • 209
  • 186
  • It's not exactly the answer, but i took your advice use Text::Iconv; my $converter = Text::Iconv->new("utf8", "CP1255"); $content=$converter->convert($content); my $converter = Text::Iconv->new("CP1255", "utf8"); $content=$converter->convert($content); solved the problem .... Yey! – Ricky Levi Feb 26 '10 at 15:01
  • The error "Cannot decode string with wide characters" means that the string is already decoded. Your use of `Text::Iconv`'s conversion from UTF-8->CP1255->UTF-8 only works because Perl's internal encoding is UTF-8. The original `$content` is a character string (according to the error message you got from decode), but you should be passing a byte string to `convert`. You can probably just do `encode('UTF-8',$content)` to get a UTF-8 byte string if that is what you want. – Chris Johnsen Feb 26 '10 at 22:17