As the standard functions for character classifications from <cctype>
and <locale>
are not capable of multi-byte character encodings like UTF-8 one has to resort to other implementations. A suitable library might be Boost.Locale (probably with ICU as its backend). Unfortunately I could not find how to iterate over an UTF-8 encoded string codepoint by codepoint or glyph by glyph and classify it as e.g. upper- or lowercase, whitespace, etc.
There have been similar questions without satisfying answers:
- Character classification
- Boost.Locale and isprint
- Why boost locale didn't provide character level rule type?
There are, however, low-level functions in ICU as suggested by other answers:
- C++ function that tells if a unicode point is a 'letter' and not number of punctuation
- Why boost locale didn't provide character level rule type?
Q1: Given the easy sounding task to iterate over an UTF-8 encoded string and classifying each character as upper- or lowercase or whitespace. How would one implement it in C++ with Boost.Locale?
Q2: If Boost.Locale is not capable of doing it but ICU is. How would one use Boost.Locale to get a suitable value to be passed to ICU's classification functions? ICU usually takes an int32_t. How to get this from an UTF-8 string via Boost.Locale?
Q3: Boost.Locale's functions to operate on UTF-8 strings usually also take a locale as parameter. How to pass that parameter if I don't know which language a string contains? E.g. a string could contain English or Chinese text independently from the locale. Doesn't UTF-8 define properties like WSpace
independent from any locale? So does it matter which locale I provide as long as it is an UTF-8 locale?
Target platform is Windows. Compiler is Visual Studio 2015.