8

I have a problem with Perl and Encoding pragma.

(I use utf-8 everywhere, in input, output, the perl scripts themselves. I don't want to use other encoding, never ever.)

However. When I write

binmode(STDOUT, ':utf8');
use utf8;
$r = "\x{ed}";
print $r;

I see the string "í" (which is what I want - and what is U+00ED unicode char). But when I add the "use encoding" pragma like this

binmode(STDOUT, ':utf8');
use utf8;
use encoding 'utf8';
$r = "\x{ed}";
print $r;

all I see is a box character. Why?

Moreover, when I add Data::Dumper and let the Dumper print the new string like this

binmode(STDOUT, ':utf8');
use utf8;
use encoding 'utf8';
$r = "\x{ed}";
use Data::Dumper;
print Dumper($r);

I see that perl changed the string to "\x{fffd}". Why?

Karel Bílek
  • 36,467
  • 31
  • 94
  • 149

2 Answers2

10

use encoding 'utf8' is broken. Rather than interpreting \x{ed} as the code point U+00ED, it interprets it as the single byte 237 and then tries to interpret that as UTF-8. Which of course fails, so it winds up replacing it with the replacement character U+FFFD, literally "�".

Just stick with use utf8 to specify that your source is in UTF-8, and binmode or the open pragma to specify the encoding for your file handles.

Anomie
  • 92,546
  • 13
  • 126
  • 145
  • Oh... OK. I can't claim I understand the reason for the reinterpreting, but there are far, far more weird things in perl. Thanks – Karel Bílek Mar 19 '11 at 16:16
  • 3
    As far as I can tell, the reason is that `use encoding` was designed so people could write `use encoding 'euc-jp'; $r = "\xF1\xD1\xF1\xCC";` and have it interpreted "correctly". But that would mean you'd have to write your UTF-8 string in the same style, as `$r = "\xC3\xAD";`. Which then gets confusing when combined with Perl's native support for UTF-8 like `$r = "\x{200b}";`, escapes with codes 0x80-0xff are interpreted differently from escapes with codes 0x100 and up. – Anomie Mar 19 '11 at 16:20
  • 3
    Yeah, Perl's support for 8-bit locales (`use encoding`, `use locale`) should be kept at the other end of a very long stick. – hobbs Mar 19 '11 at 17:08
5

Your actual code needs neither use encoding nor use utf8 to run properly -- the only thing it depends on is the encoding layer on STDOUT.

binmode(STDOUT, ":utf8");
print "\xed";

is an equally valid complete program that does what you want.

use utf8 should be used only if you have UTF-8 in literal strings in your program -- e.g. if you had written

my $r = "í";

then use utf8 would cause that string to be interpreted as the single character U+00ED instead of the series of bytes C3 AD.

use encoding should never be used, especially by someone who likes Unicode. If you want the encoding of stdin/out to be changed you should use -C or PERLUNICODE or binmode them yourself, and if you want other handles to be automatically openhed with encoding layers you should useopen.

hobbs
  • 223,387
  • 19
  • 210
  • 288