1

I have a problem similar to How can I properly align UTF-8 strings with Perl's printf?:

My (Linux) system's locale default is LC_CTYPE=de_DE.UTF-8 and I wrote a Perl program (using perl-5.26.1) that is "not" using Unicode characters, but some from the ISO Latin-1 character set (that is ° for example). There fore I did not activate any Unicode or locale features in my Perl script.

"Everything" seems to work fine with one excpetion: I'm using a printf format of %-10s to align strings but that does not work as expected.

Playing in the debugger I fount this behavior:

  DB<1> $s='X°X'

  DB<2> printf("_%3s_\n", $s)
_X°X_

Looks OK so far...

  DB<3> printf("_%4s_\n", $s)
_X°X_

Oops; shouldn't that be "_ X°X_"?

  DB<4> printf("_%5s_\n", $s)
_ X°X_

Off by one?

  DB<5> x length($s)
0  4

Shouldn't that be 3?

  DB<8> x ord($s[1])
0  0
  DB<9> x $s
0  'X°X'
  DB<10>

Shouldn't ° be encoded as one byte? I thought UTF-8 maps the Latin-1 range unmodified to Unicode.

So may questions are:

  1. What's going on?

  2. Is it a Perl bug?

  3. If not, how can I fix the formatting and string length?

U. Windl
  • 3,480
  • 26
  • 54
  • @Keith Thompson: I read that before but actually https://en.wikipedia.org/wiki/UTF-8#Codepage_layout suggests that it is not true: *Parts* of Latin 1 are literally mapped into UTF-8. Specifically "U+00B0 ° Degree symbol". – U. Windl Jul 12 '19 at 22:14
  • 1
    The wikipedia page states that $B0 is in the range of continuation bytes of the utf-8 encoding scheme from which follows that the Unicode code point U+00B0 is not mapped 1:1. Analyzing the specs of the utf8 encoding shows that if bit #7 of the first octet of a utf8 code is set, this byte contains bits that indicate the length of the code so no 1:1 mappings are possible for any code point from the range U+0080 to U+00FF. – collapsar Jul 12 '19 at 22:31
  • @collapsar: OK, excuse my misunderstanding of UTF-8 encoding. – U. Windl Jul 12 '19 at 22:34
  • So the Unicode character code for `°` still is `$B0`, but the actual encoding is `$C2 $B0`? – U. Windl Jul 12 '19 at 22:56
  • @U.Windl: Unicode uses the same numeric values for characters 0..255. UTF-8 encodes each character in the range 0..127 (ASCII) as a single byte, and each character in the range 128..255 (outside ASCII, within Latin-1) as two bytes.Characters up to 2047 are also encoded as two bytes. Remember, UTF-8 is not Unicode; it's one of several encodings of Unicode. – Keith Thompson Jul 12 '19 at 23:21
  • @U.Windl: I should have mentioned, the numeric values of the characters are called *code points*. So 0xb0 (176) is the code point for the DEGREE SIGN character, and UTF-8 encodes it as a byte with value 0xc2 followed by a byte with value 0xb0. – Keith Thompson Jul 12 '19 at 23:24

2 Answers2

3

UTF-8 only maps the ASCII range (0..127) to 1 byte. Latin-1 characters are in the range 0..255; UTF-8 can't map them all to one byte. If it did, there would be no mappings left for anything else.

Characters from 0 to 127 are encoded in 1 byte.
Characters from 128 to 2047 are encoded in 2 bytes.
And so on.

https://en.wikipedia.org/wiki/UTF-8

You need use utf8; and binmode STDOUT, ':encoding(UTF-8)'; in your Perl script (I did the same with STDIN and STDERR just for consistency):

#!/usr/bin/perl

use strict;
use warnings;
use utf8;

BEGIN {
    binmode STDIN,  ':encoding(UTF-8)';
    binmode STDOUT, ':encoding(UTF-8)';
    binmode STDERR, ':encoding(UTF-8)';
}

printf "|%-10s|\n", "x";
printf "|%-10s|\n", "°";

The output is correctly aligned:

|x         |
|°         |

If I comment out either use utf8; or binmode STDOUT, ':encoding(UTF-8)';, the output is misaligned and/or the degree character isn't displayed correctly.

Quoting perldoc utf8 (the documentation for the utf8 module):

The "use utf8" pragma tells the Perl parser to allow UTF-8 in the program text in the current lexical scope.

(This requires an output device or terminal emulator configured to display UTF-8.)

Keith Thompson
  • 254,901
  • 44
  • 429
  • 631
  • Odd things happen when I `use utf8`: In the source `°` is encoded as `\xb0` (according to Emacs), the character output also is `\xb0`, but with `binmode STDOUT, 'utf8'` the output for `°` is `\xc2\xb0`. However when I load the output into Emacs, it claims that the character encoding is `Char: ° (176, #o260, #xb0, file ...)...`. – U. Windl Jul 12 '19 at 22:31
  • @U.Windl: Then your source file is encoded as Latin-1 or something similar. The UTF-8 encoding of the DEGREE SIGN character (code point 0xb0) is (0xc2, 0xb0). It's probably best to use UTF-8 consistently for your source file(s). But if you have a lot of Latin-1-encoded source files, translating them might be non-trivial. (See the `iconv` command if you want to do this.) If you want to keep your source files as Latin-1, that's likely to be more complicated. – Keith Thompson Jul 12 '19 at 23:06
  • 1
    @ikegami: So `man perluniintro` is wrong when it still recommends `binmode(STDOUT, ":utf8")`? – U. Windl Jul 12 '19 at 23:06
  • @ikegami: Edited -- is this better? – Keith Thompson Jul 12 '19 at 23:08
  • 1
    @U.Windl, Judge for yourself: [`:encoding(UTF-8)` vs `:encoding(utf8)` vs `:utf8`](https://stackoverflow.com/a/49040165/589924) – ikegami Jul 12 '19 at 23:13
  • @U.Windl: `perldoc binmode` says: "`:utf8`" just marks the data as UTF-8 without further checking, while "`:encoding(UTF-8)`" checks the data for actually being valid UTF-8. More details can be found in `PerlIO::encoding`. – Keith Thompson Jul 12 '19 at 23:14
  • 1
    Note that in Unicode a single character can have 0, 1, 2 or ambiguous width when printed to terminal (or similar column-oriented output format). Perl's `length` and `sprintf` will not give the correct result in all cases. Use [`columns` in Unicode::GCString](http://p3rl.org/Unicode::GCString#columns) instead. – daxim Jul 13 '19 at 04:58
2

Perl code must be encoded using ASCII (no utf8;, the default) or UTF-8 (use utf8;).

° is not in the ASCII character set, and you apparently didn't use utf8; either, so your program couldn't possibly contain ° as you claim.

First, encode the program using UTF-8 (if it's not already) and tell Perl that your program is encoded using UTF-8 by adding

use utf8;   # The source code is encoded using UTF-8.

Secondly, you apparently didn't tell Perl to encode what you printed either. Fix that by adding

use open ':std', ':encoding(UTF-8)';   # The terminal provides/expects UTF-8.

The latter sets the default encoding for files open in scope of the pragma. If you want to avoid this, you can use the following instead:

BEGIN {   # The terminal provides/expects UTF-8.
   binmode(STDIN,  ':encoding(UTF-8)');
   binmode(STDOUT, ':encoding(UTF-8)');
   binmode(STDERR, ':encoding(UTF-8)');
}
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • I don't understand the statement on ASCII: I had been writing Perl code in ISO Latin1 character sets, and even processed such input and output data without a problem in the past. So your claim "either US-ASCII or UTF-8" seems incorrect unless there was some recent change in Perl regarding this. – U. Windl Jul 12 '19 at 22:38
  • Perl doesn't know and has never known anything about iso-latin-1. I can only surmise that a combination of bugs mostly cancelled each other out in your past experience ([as explained here](https://www.perlmonks.org/?node_id=11101840)), but you've finally reached a situation where that wasn't the case. – ikegami Jul 12 '19 at 22:51
  • Maybe for my level of processing it was sufficient for Latin1 processing that perl strings were "eight-bit-clean" (just write what you got unmodified). – U. Windl Jul 12 '19 at 23:09
  • Is there a simplification when assuming that source code, input data and output data all use the same encoding, namely that of `$LANG` or `$LC_CTYPE`? – U. Windl Jul 12 '19 at 23:35
  • I think you're asking for `use open ':std', ':encoding(locale)';`? Source still need to be ASCII (`no utf8;`) or UTF-8 (`use utf8;`). Also, [see this](https://www.perlmonks.org/?node_id=1231890) – ikegami Jul 12 '19 at 23:43
  • Outside of `use utf8` it is treated as the native single-byte encoding, not ASCII. You of course can only (mostly) rely on the native single-byte encoding being a superset of ASCII, but ISO-8859-1 happens to have the same layout as those unicode codepoints, so even for example encoding such a string to UTF-8 happens to work as expected. – Grinnz Jul 17 '19 at 18:43
  • There was [a proposal](https://www.nntp.perl.org/group/perl.perl5.porters/2017/10/msg246838.html) to deprecate use of non-ASCII characters without `use utf8` which would clarify a lot of this, but nothing really came of it so far. – Grinnz Jul 17 '19 at 18:44
  • @Grinnz, Re "*Outside of use utf8 it is treated as the native single-byte encoding, not ASCII.*", That is not true. Try `sub fête { }` without `use utf8;` Try `uc("ů")` without `use utf8;`. Try `encode("UTF-8", "ů")` without `use utf8;`. None of these work. As I said, Perl expects ASCII. It simply doesn't throw an error if you use non-ASCII in string literals, making it 8-bit clean. Visit the link in the second comment (my first comment). – ikegami Jul 17 '19 at 20:53
  • @Grinnz, Re "*so even for example encoding such a string to UTF-8 happens to work as expected*", That is most definitely not true. `encode` coincidentally works for iso-latin-1 because iso-latin-1 is a subset of Unicode Code Points. Try it with any other encoding (e.g. cp1252), and it won't work. The two other examples I gave DON'T work for iso-latin-1. – ikegami Jul 17 '19 at 20:56
  • @ikegami That's what I meant: ISO-8859-1 is a subset. It won't work for other single-byte encodings. – Grinnz Jul 17 '19 at 20:56
  • @Grinnz, 1) You didn't mention ISO-8859-1 at all. What you did say was completely wrong. 2) The other two example I gave still don't work work with iso-8859-1. – ikegami Jul 17 '19 at 20:57
  • @ikegami I... did? – Grinnz Jul 17 '19 at 20:59
  • @Grinnz, You mentioned it elsewhere. Not talking about that. Your contradiction to what I said ("Outside of use utf8 it is treated as the native single-byte encoding, not ASCII.") is completely wrong. Perl expects ASCII or UTF-8. It doesn't accept the local encoding as you claim, and it doesn't accept iso-latin-1 as I think you also claim. These will not work. I gave you three tests you can perform to see for yourself. – ikegami Jul 17 '19 at 21:00
  • @ikegami `ů` is not in ISO-8859-1, but `uc("ú")` (in a latin-1 encoded file) does work with `use feature "unicode_strings"`. Without that feature (or the string getting upgraded some other way), only ASCII characters have uppercase analogues. It also can be encoded to UTF-8 successfully in either case. I don't know of a way to apply unicode rules to identifier parsing outside of `use utf8`. – Grinnz Jul 17 '19 at 21:23
  • Re "*I don't know of a way to apply unicode rules to identifier parsing outside of use utf8.*", That makes no sense. There's only one set of rules. You just have to use an encoding that has non-ASCII characters. Like I said, Perl doesn`t know anything about iso-latin-1 or any encoding other than ASCII and UTF-8. (Actually, it also understands UTF-16, or at least UTF-16le, but that might be specific to Windows builds.) – ikegami Jul 17 '19 at 21:26
  • @ikegami Incorrect, there are two distinct sets of rules of identifier parsing as listed in [perldata](https://perldoc.pl/perldata#Identifier-parsing). – Grinnz Jul 17 '19 at 21:27
  • Re "*there are two distinct sets of rules of identifier parsing as listed in perldata.*", Prove it. Show me a difference between an ASCII file with `no utf8;` and a UTF-8 file with `use utf-8;`. The docs are poorly written if they say what you claim they say. – ikegami Jul 17 '19 at 21:32
  • @ikegami The fact that `sub fête` does not work when encoded to ISO-8859-1 without `use utf8` is the difference. – Grinnz Jul 17 '19 at 21:36
  • Right. Perl doesn't understand iso-latin-1. We covered this. That's not a difference in parsing rules – ikegami Jul 17 '19 at 21:39
  • @ikegami It doesn't. That `uc("ú")` works with the `unicode_strings` feature or if the string gets upgraded is because it's a subset of Unicode code points, not because it's ISO-8859-1. I agree with that. But that's why things happen to work. – Grinnz Jul 17 '19 at 21:40
  • @ikegami Perhaps a better wording is what the mentioned perldata section uses: Perl expects UTF-8 or "ASCII + 128 extra generic characters". And it will assume those bytes map to the corresponding characters when treated like characters. – Grinnz Jul 17 '19 at 21:45
  • @Grinnz, That is correct (for string literals). Perl string literals are "8-bit clean". This is explained in the link in the second comment (my first). – ikegami Jul 17 '19 at 21:46