What is the right way to get a grapheme?

Question

Why does this print a U and not a Ü?

#!/usr/bin/env perl
use warnings;
use 5.014;
use utf8;
binmode STDOUT, ':utf8';
use charnames qw(:full);

my $string = "\N{LATIN CAPITAL LETTER U}\N{COMBINING DIAERESIS}";

while ( $string =~ /(\X)/g ) {
        say $1;
}

# Output: U

You need play these things by the numbers; don’t trust what a "terminal" displays. Pipe it through [uniquote](http://training.perl.com/scripts/uniquote), probably with `-x` or `-v`, and see what it is really doing. Eyes deceive, and programs are even worse. Your terminal program is buggy, so is lying to you. — tchrist, Feb 24 '12 at 11:54
I was reading manual and remembered this question, more about \X here: `perldoc perlrebackslash`. — k-mx, Oct 22 '19 at 16:34

score 8 · Accepted Answer · answered Feb 24 '12 at 12:02

Your code is correct.

You really do need to play these things by the numbers; don’t trust what a "terminal" displays. Pipe it through the uniquote program, probably with -x or -v, and see what it is really doing.

Eyes deceive, and programs are even worse. Your terminal program is buggy, so is lying to you. Normalization shouldn’t matter.

$ perl -CS -Mutf8 -MUnicode::Normalize -E 'say "crème brûlée"'
crème brûlée
$ perl -CS -Mutf8 -MUnicode::Normalize -E 'say "crème brûlée"' | uniquote -x
cr\x{E8}me br\x{FB}l\x{E9}e
$ perl -CS -Mutf8 -MUnicode::Normalize -E 'say NFD "crème brûlée"' 
crème brûlée
$ perl -CS -Mutf8 -MUnicode::Normalize -E 'say NFD "crème brûlée"' | uniquote -x
cre\x{300}me bru\x{302}le\x{301}e

$ perl -CS -Mutf8 -MUnicode::Normalize -E 'say NFC scalar reverse NFD "crème brûlée"' 
éel̂urb em̀erc
$ perl -CS -Mutf8 -MUnicode::Normalize -E 'say NFC scalar reverse NFD "crème brûlée")' | uniquote -x
\x{E9}el\x{302}urb em\x{300}erc
$ perl -CS -Mutf8 -MUnicode::Normalize -E 'say scalar reverse NFD "crème brûlée"'
éel̂urb em̀erc
$ perl -CS -Mutf8 -MUnicode::Normalize -E 'say scalar reverse NFD "crème brûlée"' | uniquote -x
e\x{301}el\x{302}urb em\x{300}erc

I concur. No changes are required to the code. It's an issue with the OP's terminal (and mine too, Debian's KDE's `konsole`). — ikegami, Feb 24 '12 at 17:01

score 3 · Answer 2 · answered Feb 24 '12 at 10:38

3

This works for me, though I have an older version of perl, 5.012, on ubuntu. My only change to your script is: use 5.012;

$ perl so.pl 
Ü

answered Feb 24 '12 at 10:38

beerbajay

19,652
6
58
75

use 5.010; also does the job. Tested on Ubuntu with perl 5.012; – Беров Feb 24 '12 at 13:21
It didn't work with `Konsole` (2.7.2) from KDE. Now I tried it with `xterm` and there it worked. – sid_com Feb 25 '12 at 08:03

score 1 · Answer 3 · answered Feb 24 '12 at 10:49

1

May I suggest it's the output which is incorrect? It's easy to check: replace your loop code with:

my $counter;
while ( $string =~ /(\X)/g ) {
  say ++$counter, ': ', $1;
}

... and look up how many times the regex will match. My guess it will still match only once.

Alternatively, you can use this code:

use Encode;
sub codepoint_hex {
    sprintf "%04x", ord Encode::decode("UTF-8", shift);
}

... and then print codepoint_hex ($1) instead of plain $1 within the while loop.

answered Feb 24 '12 at 10:49

raina77ow

103,633
15
192
229

3

tchrist, stop teaching that advice! Preferring implicit encoding over explicit with the Encode library produces buggy code at best and insecure at the worst. Assume `-Mstrictures -Mautodie=:all` w/examples. `perl -CD -E'open my $fh, "<", "broken-utf8"; my $foo = <$fh>; say "survived"'␤perl -E'open my $fh, "<:encoding(UTF-8)", "broken-utf8"; my $foo = <$fh>; say "survived"'␤perl -M'open=:encoding(UTF-8)' -E'open my $fh, "<", "broken-utf8"; my $foo = <$fh>; say "survived"'␤perl -MEncode=decode -E'open my $fh, "<", "broken-utf8"; my $foo = decode "UTF-8", <$fh>, Encode::FB_CROAK; say "survived"'` – daxim Feb 24 '12 at 15:52
2

@tchrist, There are many reason to use `decode` and `encode`. Many of us get input elsewhere than text file handles. I do agree that `decode` makes no sense here, though (since the match requires decoded text to work in the first place). Should be `sprintf "%04x", ord shift`. – ikegami Feb 24 '12 at 17:05
@daxim I can’t make any sense out of that. The point is that streams that are all in the same encoding should never need manual encode/decode. – tchrist Feb 24 '12 at 17:15
@ikegami Yes, you’re right: there are. They just don’t include streams that are all in the same encoding, which very nearly every single one of them is. Databases and environment variables, plus program arguments, are places where you often need to deal with encode/decode. I have seen too many programs using them on streams inapprorpriately, and so I have come to see doing so as an antipattern. – tchrist Feb 24 '12 at 17:17
@daxim Note that you should not be telling people to use `autodie`, because it’s broken. And those of us who run with `use warnings FATAL => "utf8"` and `use open ...` have no such troubles. – tchrist Feb 24 '12 at 17:18
2

Should and would in a perfect world, but Perl's not perfect. The implicit decoding facilities (`-C` switch, `use open` pragma, `open()` with layers) do not throw exceptions, even with fatalised warnings in effect (pragma `strictures` does that if you didn't recognise it from above). `perldoc PerlIO::encoding` indicates that adding `$PerlIO::encoding::fallback = Encode::FB_CROAK` should make them fatal, but it actually doesn't help. (Now that 5.16 is code-freezed, we probably have to wait a year for a fix for all this mess.) Currently *only* the Encode library DTRT. – daxim Feb 24 '12 at 17:35
@daxim That does not reflect my tests: `perl -C0 -E 'say for "caf\xE9", "stuff"' | perl -CS -Mwarnings=FATAL,utf8 -pe 'print "$. "'` certainly throws an exception. I **will not** tell folks to use nothing save the clunkiest, most error-prone,redundant,& confusing of all possible approaches, especially if this is just some bug workaround. I **will** continue to tell them to use things as though these worked right—because in the long run, they shall. That means `perl -C`,`PERL_UNICODE`,& I/O layers like `":utf8"`. Either ① you're wrong or ② there should be a release-blocking security bug filed. – tchrist Feb 25 '12 at 20:57
@daxim There is scant reason for to use the `Encode` module on streams. Some people have other reasons—like databases—but most don’t. If there is some problem with builtin Unicode handling that is a security risk, then stop the presses and halt the release with a security bug. If you won’t do that, there’s no problem worthy of all this rigamarole&folderol. Perl is supposed to make easy thing easy: we ***shall*** not suffer years of injury against clear code. That’s what you espouse, and it is the wrong way to go. If there’s a bug, then kindly file it and halt the release; otherwise move along. – tchrist Feb 25 '12 at 21:02

score 1 · Answer 4 · answered Feb 24 '12 at 10:51

1

1) Apparently, your terminal can't display extended characters. On my terminal, it prints:

U¨

2) \X doesn't do what you think it does. It merely selects characters that go together. If you use the string "fu\N{COMBINING DIAERESIS}r", your program displays:

f
u¨
r

Note how the diacritic mark isn't printed alone but with its corresponding character.

3) To combine all related characters in one, use the module Unicode::Normalize:

use Unicode::Normalize;

my $string = "fu\N{COMBINING DIAERESIS}r";
$string = NFC($string);

while ( $string =~ /(\X)/g ) {
    say $1;
}

It displays:

f
ü
r

answered Feb 24 '12 at 10:51

Stamm

947
5
17

1

**FIRST:** That is not what NFC does. It just happens to do so here. It does many other things; people are mistaken about its general use and purpose. **SECOND:** If your terminal program won't display combining characters correctly, it is treating canonically equivalent sequences differently, which is evil and wrong. See Conformance Requirement C6 on p.60 of the Unicode Standard. Yours is buggy: you shouldn’t need to diddle it, else you can’t write: `perl -CS -Mutf8 -MUnicode::Normalize -E 'say scalar reverse NFD("crème brûlée")'` => `éel̂urb em̀erc`. – tchrist Feb 24 '12 at 11:46
1

Notice that running NFC on "éel̂urb em̀erc" will not "combine all related characters into one". – tchrist Feb 24 '12 at 11:52
1

What do you think he thinks `\X` does? – tchrist Feb 24 '12 at 12:03
@tchrist **I.** Yes, you're right. My terminal is as buggy as the OP's one but for different reasons. It should combine diacritics itself. But I believe normalization can be used to display extended characters on not-fully-unicode-compliant terminals. **II.** I'm not sure I really understand your point on NFC. Accents will be all wrong because characters are reversed. No surprise here. **III.** `\X` matches a character and all its subsequent diacritic marks. Am I wrong? – Stamm Feb 24 '12 at 13:18
1

I don’t know what “extended characters” are. Characters with extenders instead of descenders? My point about NFC is that its main job is render diacritics in a predictable ordering: hence *canonical*. It just so happens that with **a scant few** of them, it elects a precomposed character. Yes, it does so in the *ü* case. But there are only a few compat glyphs, and there are infinite graphemes. If I have an underline, a macron, and a tilde on a base letter, it can’t combine those three marks into one precomposed codepoint, because there is no such thing. Normalization also kill singletons, BTW. – tchrist Feb 24 '12 at 13:26

What is the right way to get a grapheme?

4 Answers4