Perl's use encoding pragma breaking UTF strings

Question

I have a problem with Perl and Encoding pragma.

(I use utf-8 everywhere, in input, output, the perl scripts themselves. I don't want to use other encoding, never ever.)

However. When I write

binmode(STDOUT, ':utf8');
use utf8;
$r = "\x{ed}";
print $r;

I see the string "í" (which is what I want - and what is U+00ED unicode char). But when I add the "use encoding" pragma like this

binmode(STDOUT, ':utf8');
use utf8;
use encoding 'utf8';
$r = "\x{ed}";
print $r;

all I see is a box character. Why?

Moreover, when I add Data::Dumper and let the Dumper print the new string like this

binmode(STDOUT, ':utf8');
use utf8;
use encoding 'utf8';
$r = "\x{ed}";
use Data::Dumper;
print Dumper($r);

I see that perl changed the string to "\x{fffd}". Why?

See also: http://stackoverflow.com/questions/492838/why-do-my-perl-tests-fail-with-use-encoding-utf8 — Eugene Yarmash, Mar 19 '11 at 16:09

score 10 · Accepted Answer · answered Mar 19 '11 at 16:07

10

use encoding 'utf8' is broken. Rather than interpreting \x{ed} as the code point U+00ED, it interprets it as the single byte 237 and then tries to interpret that as UTF-8. Which of course fails, so it winds up replacing it with the replacement character U+FFFD, literally "�".

Just stick with use utf8 to specify that your source is in UTF-8, and binmode or the open pragma to specify the encoding for your file handles.

answered Mar 19 '11 at 16:07

Anomie

92,546
13
126
145

Oh... OK. I can't claim I understand the reason for the reinterpreting, but there are far, far more weird things in perl. Thanks – Karel Bílek Mar 19 '11 at 16:16
3

As far as I can tell, the reason is that `use encoding` was designed so people could write `use encoding 'euc-jp'; $r = "\xF1\xD1\xF1\xCC";` and have it interpreted "correctly". But that would mean you'd have to write your UTF-8 string in the same style, as `$r = "\xC3\xAD";`. Which then gets confusing when combined with Perl's native support for UTF-8 like `$r = "\x{200b}";`, escapes with codes 0x80-0xff are interpreted differently from escapes with codes 0x100 and up. – Anomie Mar 19 '11 at 16:20
3

Yeah, Perl's support for 8-bit locales (`use encoding`, `use locale`) should be kept at the other end of a very long stick. – hobbs Mar 19 '11 at 17:08

score 5 · Answer 2 · answered Mar 19 '11 at 16:20

Your actual code needs neither use encoding nor use utf8 to run properly -- the only thing it depends on is the encoding layer on STDOUT.

binmode(STDOUT, ":utf8");
print "\xed";

is an equally valid complete program that does what you want.

use utf8 should be used only if you have UTF-8 in literal strings in your program -- e.g. if you had written

my $r = "í";

then use utf8 would cause that string to be interpreted as the single character U+00ED instead of the series of bytes C3 AD.

use encoding should never be used, especially by someone who likes Unicode. If you want the encoding of stdin/out to be changed you should use -C or PERLUNICODE or binmode them yourself, and if you want other handles to be automatically openhed with encoding layers you should useopen.

hobbs: yes, I have actual UTF-8 literals in my code (in regular expressions). Thanks. — Karel Bílek, Mar 19 '11 at 16:41

Perl's use encoding pragma breaking UTF strings

2 Answers2

Linked