C++ function that tells if a unicode point is a 'letter' and not number of punctuation

Question

Is there a C++ function available that decides if a given Unicode point is a letter? I mean what's often described as \p{L} in regular expressions. So it could be a Latin, Greek, Cyrillic or other letter, as opposed to punctuation, numbers, etc., which, in unicode are also be represented by several other large code point ranges.

So what I'm asking for a function similar to this:

bool isUnicodeLetter(int32 codepoint);

Maybe in the boost or ICU libraries?

[This question](http://stackoverflow.com/questions/3378343/isalpha-equivalent-for-wchar-t) appears to be similar. — Mark Wilkins, Aug 24 '12 at 23:25
Functions that deal with Unicode should never ever take a single codepoint as though all characters can be represented as a single codepoint (because not all can be). You need a function like `bool isUnicodeLetter(std::u32string character);`. If you find a function that takes a codepoint then be sure to never use it because it's necessarily wrong. — bames53, Aug 24 '12 at 23:32
@barnes53: This seems to contradict Daniel Trebbien's answer. The ICU library is a highly regarded standard unicode library. Are you saying they got it wrong? — Frank, Aug 25 '12 at 02:23
@Frank I believe that function could be legitimately used as a query on whether the codepoint has a certain Unicode property, however it could easily be misused by someone wanting to classify _characters_. — bames53, Aug 25 '12 at 03:26

score 3 · Accepted Answer · edited Jun 20 '20 at 09:12

3

In ICU4C, the function is called u_isalpha():

UBool u_isalpha(UChar32 c)
Determines whether the specified code point is a letter character.

True for general categories "L" (letters).

But be careful when using this as it is easy to misuse. u_isalpha() and the other functions in uchar.h are only designed to provide low-level access to Unicode character data.

edited Jun 20 '20 at 09:12

Community

1
1

answered Aug 24 '12 at 23:53

Daniel Trebbien

38,421
18
121
193

Note that the name of the function is deceptive, in that it does test whether the code point has the `Alphabetic=Yes` Unicode character property; it only tests whether it has the `General_Category=Letter` property, which itself is any of `General_Category=Lowercase_Letter`, `General_Category=Modifier_Letter`, `General_Category=Other_Letter`, `General_Category=Titlecase_Letter`, `General_Category=Uppercase_Letter`, and nothing else. – tchrist Aug 25 '12 at 01:10
@tchrist: I'm quite happy with that function. Why is the name deceptive? It tests for any letter property. What else would you want? – Frank Aug 25 '12 at 02:25
Well, I would want it to test the Unicode `alpha` property, which is something else. This would only work on the Latin script, because others have non-`Letter` codepoints with the `Other_Alphabetic` property. In some scripts, it is useless to just test for simple letters. Alphabetic is not just letters in Unicode. Sorry. – tchrist Aug 25 '12 at 02:26
Oh, do you have a typo in your first comment, then? "in that it does test whether the code point has the Alphabetic=Yes Unicode character property" Did you want to say "doesn't test"? – Frank Aug 25 '12 at 02:28
How is `u_isalpha` from ICU different from `std::iswalpha`, which I found on the related question? http://stackoverflow.com/questions/3378343/isalpha-equivalent-for-wchar-t?lq=1 – Frank Aug 25 '12 at 02:29
1

@Frank: Just as the C and POSIX standards do not define the semantics of `isalpha()` beyond the ASCII range, the semantics of `std::iswalpha` is similarly only defined for the current wide character set, as specified by the `LC_CTYPE` category of the current locale. – Daniel Trebbien Aug 25 '12 at 12:49
1

BTW, in addition to `u_isalpha` I also found `u_isUAlphabetic(UChar32)`, which just checks the binary property `Alphabetic`. – Frank Aug 27 '12 at 18:33

C++ function that tells if a unicode point is a 'letter' and not number of punctuation

1 Answers1

Linked