2

Is there a C++ function available that decides if a given Unicode point is a letter? I mean what's often described as \p{L} in regular expressions. So it could be a Latin, Greek, Cyrillic or other letter, as opposed to punctuation, numbers, etc., which, in unicode are also be represented by several other large code point ranges.

So what I'm asking for a function similar to this:

bool isUnicodeLetter(int32 codepoint);

Maybe in the boost or ICU libraries?

Frank
  • 64,140
  • 93
  • 237
  • 324
  • [This question](http://stackoverflow.com/questions/3378343/isalpha-equivalent-for-wchar-t) appears to be similar. – Mark Wilkins Aug 24 '12 at 23:25
  • 1
    Functions that deal with Unicode should never ever take a single codepoint as though all characters can be represented as a single codepoint (because not all can be). You need a function like `bool isUnicodeLetter(std::u32string character);`. If you find a function that takes a codepoint then be sure to never use it because it's necessarily wrong. – bames53 Aug 24 '12 at 23:32
  • @barnes53: This seems to contradict Daniel Trebbien's answer. The ICU library is a highly regarded standard unicode library. Are you saying they got it wrong? – Frank Aug 25 '12 at 02:23
  • 2
    @Frank I believe that function could be legitimately used as a query on whether the codepoint has a certain Unicode property, however it could easily be misused by someone wanting to classify _characters_. – bames53 Aug 25 '12 at 03:26

1 Answers1

3

In ICU4C, the function is called u_isalpha():

UBool u_isalpha(UChar32 c)

Determines whether the specified code point is a letter character.

True for general categories "L" (letters).

But be careful when using this as it is easy to misuse. u_isalpha() and the other functions in uchar.h are only designed to provide low-level access to Unicode character data.

Community
  • 1
  • 1
Daniel Trebbien
  • 38,421
  • 18
  • 121
  • 193
  • Note that the name of the function is deceptive, in that it does test whether the code point has the `Alphabetic=Yes` Unicode character property; it only tests whether it has the `General_Category=Letter` property, which itself is any of `General_Category=Lowercase_Letter`, `General_Category=Modifier_Letter`, `General_Category=Other_Letter`, `General_Category=Titlecase_Letter`, `General_Category=Uppercase_Letter`, and nothing else. – tchrist Aug 25 '12 at 01:10
  • @tchrist: I'm quite happy with that function. Why is the name deceptive? It tests for any letter property. What else would you want? – Frank Aug 25 '12 at 02:25
  • Well, I would want it to test the Unicode `alpha` property, which is something else. This would only work on the Latin script, because others have non-`Letter` codepoints with the `Other_Alphabetic` property. In some scripts, it is useless to just test for simple letters. Alphabetic is not just letters in Unicode. Sorry. – tchrist Aug 25 '12 at 02:26
  • Oh, do you have a typo in your first comment, then? "in that it does test whether the code point has the Alphabetic=Yes Unicode character property" Did you want to say "doesn't test"? – Frank Aug 25 '12 at 02:28
  • How is `u_isalpha` from ICU different from `std::iswalpha`, which I found on the related question? http://stackoverflow.com/questions/3378343/isalpha-equivalent-for-wchar-t?lq=1 – Frank Aug 25 '12 at 02:29
  • 1
    @Frank: Just as the C and POSIX standards do not define the semantics of `isalpha()` beyond the ASCII range, the semantics of `std::iswalpha` is similarly only defined for the current wide character set, as specified by the `LC_CTYPE` category of the current locale. – Daniel Trebbien Aug 25 '12 at 12:49
  • 1
    BTW, in addition to `u_isalpha` I also found `u_isUAlphabetic(UChar32)`, which just checks the binary property `Alphabetic`. – Frank Aug 27 '12 at 18:33