Grammar and unicode characters

Question

Why the below Grammar fails to parse for unicode characters?

it parses fine after removing word boundaries from <sym>.

#!/usr/bin/env perl6

grammar G {


  proto rule TOP { * }

  rule TOP:sym<y>  { «<.sym>» }
  rule TOP:sym<✓>  { «<.sym>» }

}

say G.parse('y'); # ｢y｣
say G.parse('✓'); # Nil

raiph · Answer 1 · 2019-08-18T11:21:04.677

8

From the « and » "left and right word boundary" doc:

[«] matches positions where there is a non-word character at the left, or the start of the string, and a word character to the right.

✓ isn't a word character. So the word boundary assertion fails.

What is and isn't a "word character"

"word", in the sense of the \w character class, has the same definition in P6 as it does in P5 (when not using the P5 \a regex modifier), namely letters, some decimal digits, or an underscore:

Characters whose Unicode general category starts with an L, which stands for Letter.¹
Characters whose Unicode general category is Nd, which stands for Number, decimal.²
_, an underscore.

"alpha 'Nd under"

In a comment below @p6steve++ contributes a cute mnemonic that adds "under" to the usual "alphanum".

But "num" is kinda wrong because it isn't any number but only some decimal digits, specifically the characters that match the Unicode General Category Nd (matched by P6 regex /<:Nd>/).²

This leads naturally to alphaNdunder (alpha Nd under) pronounced "alpha 'nd under".

Footnotes

¹ Letters are matched by the P6 regex /<:L>/. This includes Ll (Letter, lowercase) (matched by /<:Ll>/) as JJ notes but also others including Lu (Letter, uppercase) and Lo (Letter, other), which latter includes the ら character JJ also mentions. There are other letter sub-categories too.

² Decimal digits with the Unicode general category Nd are matched by the P6 regex /<:Nd>/. This covers decimal digits that can be chained together to produce arbitrarily large decimal numbers where each digit position adds a power of ten. It excludes decimal digits that have a "typographic context" (my phrasing follows the example of Wikipedia). For example, 1 is the English decimal digit denoting one; it is included. But ¹ and ① are excluded because they have a "typographic context". For a billion+ people their native languages use १ to denote one and १ is included in the Nd category for decimal digits. But for another billion+ people their native languages use 一 for one but it is excluded from the Nd category (and is in the L category for letters instead). Similarly ६ (Devanagari 6) is included in the Nd category but 六 (Han number 6) is excluded.

edited Aug 18 '19 at 11:21

answered Aug 16 '19 at 08:30

raiph

31,607
3
62
111

Thanks @raiph. I misunderstood what `«` actually does, but now i understand. – hythm Aug 17 '19 at 06:46
What i'm trying to do is to match `` (any printable character) but with condition that `` must be either the whole text, or in between two spaces, or between start of text and space comes after it, or between end of text and space comes before it (And the match process should not consume the spaces). Is there a good way to do that? – hythm Aug 17 '19 at 07:01
"What i'm trying to do is to match `` (any printable character)" Hmm. First, [what [are the\] Unicode Printable Characters?](https://stackoverflow.com/questions/3770117/what-is-the-range-of-unicode-printable-characters) Even if you assume someone has the right font (maybe you plan to assume the latest [noto font](https://fonts.google.com/specimen/Noto+Sans)?), whether or not a character is *actually* printable can only be known by trying to print it and seeing what gets rendered. But perhaps by "printable" you mean "*probably* doesn't *render* on *most* systems as *invisible* whitespace"? – raiph Aug 17 '19 at 10:11
If by `` you **do** mean "probably doesn't render on most systems as invisible whitespace" then that suggests the thing to do is match "not [whitespace](https://en.wikipedia.org/wiki/Unicode_character_property#Whitespace)". – raiph Aug 17 '19 at 10:42
If by `` you mean "**definitely** does *not* render with an `�` in it using *this font* on *this system*" then I'm thinking that **A** You'd have to write code that trial renders the character and diagnoses the result and **B** It would take extensive engineering, leaning heavily on caching, and testing, evolution, and documentation, multiplied for each Rakudo backend on which it's implemented, if it's to have any chance of being anything other than absurdly slow and complex to use in a regex. Are you ready for a heroic N year journey with no guarantee of useful success at the end of it? – raiph Aug 17 '19 at 10:59
1

Right. I meant "probably doesn't render on most systems as invisible whitespace". I will match "not whitespaces" and see how this works in my `Grammar`. Thanks. – hythm Aug 17 '19 at 16:20
Makes sense. I'd guess that `\s+` matches one or more characters with the Unicode property setting "WSpace=yes" per the link above but don't quote me on that. [The `\s` doc](https://docs.perl6.org/language/regexes#index-entry-regex_%5Cs-regex_%5CS-%5Cs_and_%5CS) isn't clear and I'm too tired right now to check further. N.B. Don't use the default `` unless you *also* want a mere word boundary to count as if it were whitespace. – raiph Aug 17 '19 at 17:36
1

a mnemonic for me is modern day alphanumunder, then – librasteve Aug 18 '19 at 08:11

score 4 · Accepted Answer · edited Aug 18 '19 at 21:02

4

I keep starting my answers with "Raiph is right". But he is. Also, an example of why this is so:

for <y ✓ Ⅲ> {
    say $_.uniprops;
    say m/<|w>/;
}

The second line of the loop compares against the word boundary anchor; just the first character, which can be a part of an actual word, matches that anchor. It also prints the Unicode properties in the first line of the loop; in the first case it's a letter, (Ll), it's not in the other two cases. You can use any Ll character as part of a word, and in your grammar, but only characters with that Unicode property can actually form words.

grammar G {


  proto rule TOP { * }

  rule TOP:sym<y>  { «<.sym>» }
  rule TOP:sym<ら>  { «<.sym>» }

}

say G.parse('y'); # ｢y｣
say G.parse('ら'); # This is a hiragana letter, so it works.

edited Aug 18 '19 at 21:02

raiph

31,607
3
62
111

answered Aug 16 '19 at 08:43

jjmerelo

22,578
8
40
86

1

"You can use any `Ll` character as part of a word, and in your grammar, but only characters with that Unicode property can actually form words." That's sufficiently far from correct that I feel the need to comment. See my answer for my attempt at defining **What is and isn't a "word character"**. (My original answer was super curt because I suddenly had to go do something for the day so I just published what I had at the time. Now it's not only suitably enriched but also has a bonus @p6steve inspired mnemonic power up.) – raiph Aug 18 '19 at 11:25
OK, I'll check it and will edit accordingly. That's the gist of it anyway. – jjmerelo Aug 18 '19 at 19:43

Grammar and unicode characters

2 Answers2

What is and isn't a "word character"

"alpha 'Nd under"

Footnotes