I suppose I am using LWP::Simple::get incorrectly, but I am at my wit's end as to how to correct it. My first try was a simple
perl -e 'use LWP::Simple; print get("http://localhost/wtf.txt");'
, but that did not work. wtf.txt
contains a single UTF-8-encoded character u+00f6
(i.e. ö
). Using wget
and xxd
I made sure that the HTTP server sends the correct header line Content-Type: text/plain; charset=utf-8
and that the content is as expected. But the above perl code instead returns u+00f6
as ISO-8859-1-encoded.
I thought that this is a simple encoding issue with a simple fix, but digging deeper I found it to be not quite as straightforward as I had hoped. I created a second file wtf2.txt
with the single UTF-8-encoded character u+30e4
(i.e. ヤ
) and fetched both with the following perl code:
#!/usr/bin/perl
use LWP::Simple;
$wtf=get("http://localhost/$ARGV[0]");
$wtf2=pack("H*",unpack("H*",$wtf));
print $wtf;
print "\n";
print $wtf2;
print "\n$wtf\n$wtf2\n";
print (unpack("H*",$wtf)."\n");
When fetching wtf.txt
, this code writes 4 times u+00f6
in its ISO-8859-1-encoded form, followed by f6
(its ISO-8859-1-encoded form in hex). Up to here, everything is as before. But when fetching wtf2.txt
, this code writes u+30e4
in its UTF-8-encoded form, followed by u+00e4
(i.e. ä
) in ISO-8859-1, u+30e4
in UTF-8, u+00e4
in UTF-8, e4
(ISO-8859-1 of u+00e4
in hex).
Given that u+30e4
and u+00e4
have nothing to do with each other apart from the latter being a bitmasked/truncated version of the former, I expect that not only re-encoding happens inside LWP::Simple, but also some truncating. I'm inclined to file a bug report to LWP::Simple, but I'm still hoping for a simple fix and/or an explanation.
By the way, none of the described issues occur if I replace the second and third line with $wtf=<>;
and simply read the files from stdin
instead of fetching them via LWP::Simple::get.
I tested this using perl 5.14.2 and libwww 6.04 on Debian 7.