What string should be used to specify encoding in Perl POD, "utf8", "UTF-8" or "utf-8"?

Question

It is possible to write Perl documentation in UTF-8. To do it you should write in your POD:

=encoding NNN

But what should you write instead NNN? Different sources gives different answers.

perlpod says that that should be =encoding utf8
this stackoverflow answer states that it should be =encoding UTF-8
and this answer tells me to write =encoding utf-8

What is the correct answer? What is the correct string to be written in POD?

Technically, none of those. Unicode and UTF-8 are different encodings. — cdhowie, Aug 07 '13 at 16:50
To be even more pedantic, unicode is a decoding, not an encoding. — mob, Aug 07 '13 at 16:55
Thank you =) You are right. I'll remove the term Unicode from the question. — bessarabov, Aug 07 '13 at 16:56

score 16 · Accepted Answer · answered Aug 07 '13 at 17:01

16

=encoding UTF-8

According to IANA, charset names are case-insensitive, so utf-8 is the same.

utf8 is Perl's lax variant of UTF-8. However, for safety, you want to be strict to your POD processors.

answered Aug 07 '13 at 17:01

daxim

39,270
4
65
132

Thank you. This is the answer I wanted to get =) One more thing. So [perlpod](http://perldoc.perl.org/perlpod.html) with its `=encoding utf8` is incoreect. Do you think is is worth proposing a patch? – bessarabov Aug 07 '13 at 17:06
2

It's not a big thing. Do what you want. – daxim Aug 07 '13 at 17:08

mob · Answer 2 · 2013-08-07T17:43:12.567

As daxim points out, I have been misled. =encoding=UTF-8 and =encoding=utf-8 apply the strict encoding, and =encoding=utf8 is the lenient encoding:

$ cat enc-test.pod
=encoding ENCNAME

=head1 TEST '\344\273\245\376\202\200\200\200\200\200'

=cut

(here \xxx means the literal byte with value xxx. \344\273\245 is a valid UTF-8 sequence, \376\202\200\200\200\200\200 is not)

`=encoding=utf-8`:

$ perl -pe 's/ENCNAME/utf-8/' enc-test.pod | pod2cpanhtml | grep /h1
>TEST &#39;&#20197;&#27492;&#65533;&#39;</a></h1>

`=encoding=utf8`:

$ perl -pe 's/ENCNAME/utf8/' enc-test.pod | pod2cpanhtml | grep /h1
Code point 0x80000000 is not Unicode, no properties match it; ...
Code point 0x80000000 is not Unicode, no properties match it; ...
Code point 0x80000000 is not Unicode, no properties match it; ...
>TEST &#39;&#20197;&#2147483648;&#39;</a></h1>

They are all equivalent. The argument to =encoding is expected to be a name recognized by the Encode::Supported module. When you drill down into that document, you see

the canonical encoding name is utf8
the name UTF-8 is an alias for utf8, and
names are case insensitive, so utf-8 is equivalent to UTF-8

What's the best practice? I'm not sure. I don't think you go wrong using the official IANA name (as per daxim's answer), but you can't go wrong following the official Perl documentation, either.

That alias part of the documentation mislead you, hyphen and no hyphen are treated differently. Try: `perl -MEncode=decode -MDevel::Peek=Dump -e'Dump decode "utf-8", "\xfe\x82\x80\x80\x80\x80\x80", Encode::FB_CROAK | Encode::LEAVE_SRC'` — daxim, Aug 07 '13 at 17:19
Wow! Thank you for doing such great work for showing the difference between utf8 and utf-8. — bessarabov, Aug 08 '13 at 04:26

What string should be used to specify encoding in Perl POD, "utf8", "UTF-8" or "utf-8"?

2 Answers2

=encoding=utf-8:

=encoding=utf8:

`=encoding=utf-8`:

`=encoding=utf8`: