How does Perl's length() function counts unicode characters?

Question

Why length() says this is 4 logical characters (I would expect it to say 1):

$ perl -lwe 'print length("")'
4

I guess something is wrong with my expectation. :-) What is it?

[Everything you wanted to know about Unicode handling in Perl but were afraid to ask](https://stackoverflow.com/a/6163129/226648) — el.pescado - нет войне, Nov 05 '18 at 08:22

JGNI · Accepted Answer · 2018-11-05T12:21:36.520

11

Unless you tell Perl that the source code of the script is in utf8 Perl assumes ASCII. This means that by default the Perl interpreter sees as 4 separate characters. If you change your one liner to perl -Mutf8 -lwe 'print length("")' You see length providing your expected output.

The utf8 pragma tells Perl that the source unit is in utf8 and not ASCII. See perldoc utf8 for more info.

edited Nov 05 '18 at 12:21

answered Nov 05 '18 at 08:38

JGNI

3,933
11
21

Can you share where the documentation says that Perl by default assumes latin1? – jreisinger Nov 05 '18 at 11:46
@jreisinger There is a comment in the documentation for the `encoding` pragma in the section `Implicit upgrading for byte strings`. There may be better documentation elsewhere. – JGNI Nov 05 '18 at 12:03
1

@jreisinger, It does not not assume latin-1. It assumes US-ASCII, leaving non-ASCII bytes unchanged. Since you provided the bytes `F0.9F.90.AA`, Perl created a string equivalent to the one created by `"\xF0\x9F\x90\xAA"`. With `use utf8;` (which is what `-Mutf8` adds), Perl code the source with `utf8`, so Perl creates the string equivalent to the one created by `"\x{1F42A}"`. – ikegami Nov 05 '18 at 12:06
Demonstration that it's not latin1: `perl -MEncode -e'print encode("UTF-8", "sub f\xC9 { }")' | perl -Mutf8` works, but `perl -MEncode -e'print encode("latin1", "sub f\xC9 { }")' | perl` does not. – ikegami Nov 05 '18 at 12:15
@JGNI, Those words don't appear in the [documentation for the `encoding` pragma](https://metacpan.org/pod/encoding), or at least not in the 14 versions of it found on CPAN. – ikegami Nov 05 '18 at 12:15
@ikegami Now agree with you about ASCII over Latin-1 as `perl -E 'say uc "é"'` gives `é`. What follows is from the encode pragma documentation for v5.20.2 `Implicit upgrading for byte strings By default, if strings operating under byte semantics and strings with Unicode character data are concatenated, the new string will be created by decoding the byte strings as ISO 8859-1 (Latin-1).` – JGNI Nov 05 '18 at 12:24
That's a lie. No decoding is performed. The two strings are simply concatenated as-is with no semantics presumed. The encoding pragma was deprecated (and discouraged far far longer than that) because it had some pretty wild idea about how things work or should work. – ikegami Nov 05 '18 at 12:50
Re "*Now agree with you about ASCII over Latin-1 as `perl -E 'say uc "é"'` gives `é`*", That code doesn't demonstrate that at all . Even if Perl were to decode the source using latin-1, you'd still get the same string, and you'd still get the same output. – ikegami Nov 05 '18 at 13:00

How does Perl's length() function counts unicode characters?

1 Answers1