How to get all Unicode characters from specific categories?

Question

How to get, for example..., a code point pattern like x-y\uxxxx\Uxxxxxxxxx from the Connector Punctuation (Pc) category, for scanning ECMAScript 3/JavaScript identifiers?

Original question

I need help for verifying a valid character (code point) of a ECMA-262 (3º edition, 7.6) identifier for a lexical scanner.

Syntax quote

Identifier ::

IdentifierName but not ReservedWord

IdentifierName ::

IdentifierStart

IdentifierName IdentifierPart

IdentifierStart ::

UnicodeLetter

$

_

~~\ UnicodeEscapeSequence~~ # no need to check this

IdentifierPart ::

IdentifierStart

UnicodeCombiningMark

UnicodeDigit

UnicodeConnectorPunctuation

UnicodeLetter ::

any character in the Unicode categories “Uppercase letter (Lu)”, “Lowercase > letter (Ll)”, “Titlecase letter (Lt)”, “Modifier letter (Lm)”, “Other letter (Lo)”, or “Letter number (Nl)”.

UnicodeCombiningMark ::

any character in the Unicode categories “Non-spacing mark (Mn)” or “Combining spacing mark (Mc)”

UnicodeDigit ::

any character in the Unicode category “Decimal number (Nd)”

UnicodeConnectorPunctuation ::

any character in the Unicode category “Connector punctuation (Pc)”

As you can see, it takes any character of certain categories.

I need to have all these possible characters, so my first step was to locate "Connector punctuation" on this Unicode 5.0 chart, but 0 matches and I believe I'm doing it the wrong way. So could someone help me?

score 5 · Accepted Answer · edited Jan 04 '20 at 05:55

Unicode offers this tool for determining sets of characters. It uses regular expressions with property-value pairs enclosed in [::].

For all characters in Unicode 5 you want to do [:age=5.0:].

The rest are "general categories" (gc). So for example [:age=5.0:]&[:gc=Lu:] will find all uppercase letters in Unicode 5 (gc=L will find all letters in general).

For IdentifierStart you need [:age=5.0:]&[[:gc=L:][:gc=Nl:]\$_]. For IdentifierPart you need [:age=5.0:]&[[:gc=L:][:gc=Nl:][:gc=Mn:][:gc=Mc:][:gc=Nd:][:gc=Pc:]\$_].

Unicode also has properties called ID_Start and ID_Continue but they don't include the same characters as your specifications.

Here is also an overview of all Unicode character properties.

score 0 · Answer 2 · answered Sep 28 '22 at 11:19

I'm the OP. I'm actually using another approach for determining Unicode General Category. I made a tool for converting UnicodeData.txt file into very optimal binaries: https://github.com/matheusdiasdesouzads/unicode-general-category/tree/master/data and a library for working with General Categories: https://github.com/matheusdiasdesouzads/unicode-general-category/tree/master/language-specific/javascript-nodejs

let cat = GeneralCategory.from(0x41);
cat.toString(); // 'Lu'

How to get all Unicode characters from specific categories?

Original question

2 Answers2