4

Are there any equivalents for the char functions (ispace, isalpha, etc) defined in <ctype> for char32_t?

I had a look around & could only see iswspace (& related) which seem like those are for 16bit chars.

Note: while isspace takes a int as a parameter it seems to produce erroneous results for Unicode characters.

Example:

char32_t dagger = U'';
if (isspace(dagger)) {
    puts("That is a space!");
}

Will output "That is a space!"

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
user3818491
  • 498
  • 1
  • 6
  • 16
  • The `isalpha()` set of functions take an `int` parameter - so, if your compiler uses 32-bit integers as default, then a `char32_t` argument shouldn't present a problem. – Adrian Mole Feb 22 '20 at 16:01
  • It seems to fail for Unicode characters. For example it returns true when given 128481 (which I think is a dagger emoji). – user3818491 Feb 22 '20 at 16:07
  • Posting an example showing that – user3818491 Feb 22 '20 at 16:12
  • Actually, just after I posted my comment, it occurred to me that simply interpreting a `char32_t` as an `int` would probably not work. – Adrian Mole Feb 22 '20 at 16:13

2 Answers2

4

Up to wchar_t you can use std::isalpha with the suitable locale defined in in <locale>.

For anything above 0xFFFF you will need the ICU library:

u_isalpha or u_isUAlphabetic

u_isspace or u_isUWhiteSpace

Full list of functions: uchar.h

5andr0
  • 1,578
  • 1
  • 18
  • 25
0

While C++-the-language has facilities for generating Unicode values, C++-the-library is almost completely deaf to Unicode. <ctype.h> and <cctype> have no idea how to handle Unicode values; their functionality is based on the C locale mechanism. Your implementation may provide locales that know what Unicode is, but the "C" locale that is the default is not one of them.

Nicol Bolas
  • 449,505
  • 63
  • 781
  • 982
  • So is it best just to create my own list of space characters? – user3818491 Feb 22 '20 at 16:37
  • 1
    @user3818491: The "best" thing would be to use a Unicode library like ICU. – Nicol Bolas Feb 22 '20 at 16:57
  • Yeah, but I don't really want something so big for such a little thing. – user3818491 Feb 22 '20 at 17:07
  • @user3818491: If you care enough about Unicode to want to accurately ask whether a codepoint counts as whitespace, then you will likely *eventually* want to ask other Unicode questions. Better to go with a tool that you might not fully utilize than to have to create ad-hoc solutions that may not be correct. – Nicol Bolas Feb 22 '20 at 17:26
  • Fair point, I'm already using utf8-cpp for basic parsing but maybe I do need something more. – user3818491 Feb 22 '20 at 17:58