LWP::Simple::get changes encoding

Question

I suppose I am using LWP::Simple::get incorrectly, but I am at my wit's end as to how to correct it. My first try was a simple

perl -e 'use LWP::Simple; print get("http://localhost/wtf.txt");'

, but that did not work. wtf.txt contains a single UTF-8-encoded character u+00f6 (i.e. ö). Using wget and xxd I made sure that the HTTP server sends the correct header line Content-Type: text/plain; charset=utf-8 and that the content is as expected. But the above perl code instead returns u+00f6 as ISO-8859-1-encoded.

I thought that this is a simple encoding issue with a simple fix, but digging deeper I found it to be not quite as straightforward as I had hoped. I created a second file wtf2.txt with the single UTF-8-encoded character u+30e4 (i.e. ヤ) and fetched both with the following perl code:

#!/usr/bin/perl
use LWP::Simple;
$wtf=get("http://localhost/$ARGV[0]");
$wtf2=pack("H*",unpack("H*",$wtf));
print $wtf;
print "\n";
print $wtf2;
print "\n$wtf\n$wtf2\n";
print (unpack("H*",$wtf)."\n");

When fetching wtf.txt, this code writes 4 times u+00f6 in its ISO-8859-1-encoded form, followed by f6 (its ISO-8859-1-encoded form in hex). Up to here, everything is as before. But when fetching wtf2.txt, this code writes u+30e4 in its UTF-8-encoded form, followed by u+00e4 (i.e. ä) in ISO-8859-1, u+30e4 in UTF-8, u+00e4 in UTF-8, e4 (ISO-8859-1 of u+00e4 in hex).

Given that u+30e4 and u+00e4 have nothing to do with each other apart from the latter being a bitmasked/truncated version of the former, I expect that not only re-encoding happens inside LWP::Simple, but also some truncating. I'm inclined to file a bug report to LWP::Simple, but I'm still hoping for a simple fix and/or an explanation.

By the way, none of the described issues occur if I replace the second and third line with $wtf=<>; and simply read the files from stdin instead of fetching them via LWP::Simple::get.

I tested this using perl 5.14.2 and libwww 6.04 on Debian 7.

You can also see [http://stackoverflow.com/q/2341128/2766176](http://stackoverflow.com/q/2341128/2766176). — brian d foy, Nov 05 '16 at 21:40

score 1 · Accepted Answer · answered Nov 05 '16 at 23:25

This is a bug in your code.

LWP::Simple::get doesn't return the original bytes (in some encoding), it returns decoded text (i.e. Unicode). (Which makes sense, because if it returned bytes, you wouldn't know how to decode them because get doesn't tell you the encoding.)

So get("http://localhost/wtf.txt") returns a string containing the codepoint U+00f6. print then writes some bytes to STDOUT. What are those bytes? That depends on the encoding layer currently set on the filehandle. By default this is a weird mix of Latin-1 and UTF-8 (it might even depend on the internal encoding of the string).

If you want to get UTF-8 output, do binmode STDOUT, ":encoding(UTF-8)"; first. That ensures all text written to STDOUT is encoded as UTF-8.

On the other hand, if you want to ignore encodings and just write the bytes that you received from the web server, then LWP::Simple is the wrong choice. Use LWP::UserAgent instead and call $response->content. (LWP::Simple::get uses $response->decoded_content internally.)

The truncation in your second example is probably due to pack/unpack, which don't make sense on Unicode strings (they're meant for byte strings, i.e. all codepoints <= 255).

Thank you. Both `binmode STDOUT` and `LWP::UserAgent` work. pack/unpack were the recommended way to view a hex version of perl data. Is there a better alternative, something that gives me an unaltered hex/octal/decimal view of what perl stores in its variables? If I had had that, I could've debugged it myself and wouldn't have had to bother stackoverflow with it. — user2845840, Nov 08 '16 at 12:38
@user2845840 If you want to see what perl thinks it has in its strings, use `printf "%vd\n", $str` (decimal) or `printf "%vx\n", $str` (hex). The output will be in "dotted decimal" (or hex) form, with one number per codepoint (this also tells you what perl thinks the length of the string is (number of dots + 1)). — melpomene, Nov 08 '16 at 22:40

LWP::Simple::get changes encoding

1 Answers1