4

As the standard functions for character classifications from <cctype> and <locale> are not capable of multi-byte character encodings like UTF-8 one has to resort to other implementations. A suitable library might be Boost.Locale (probably with ICU as its backend). Unfortunately I could not find how to iterate over an UTF-8 encoded string codepoint by codepoint or glyph by glyph and classify it as e.g. upper- or lowercase, whitespace, etc.

There have been similar questions without satisfying answers:

There are, however, low-level functions in ICU as suggested by other answers:

Q1: Given the easy sounding task to iterate over an UTF-8 encoded string and classifying each character as upper- or lowercase or whitespace. How would one implement it in C++ with Boost.Locale?

Q2: If Boost.Locale is not capable of doing it but ICU is. How would one use Boost.Locale to get a suitable value to be passed to ICU's classification functions? ICU usually takes an int32_t. How to get this from an UTF-8 string via Boost.Locale?

Q3: Boost.Locale's functions to operate on UTF-8 strings usually also take a locale as parameter. How to pass that parameter if I don't know which language a string contains? E.g. a string could contain English or Chinese text independently from the locale. Doesn't UTF-8 define properties like WSpace independent from any locale? So does it matter which locale I provide as long as it is an UTF-8 locale?

Target platform is Windows. Compiler is Visual Studio 2015.

sigy
  • 2,408
  • 1
  • 24
  • 55

1 Answers1

0

Locale contains lots of localization things, not only encoding, eg: data/time format, numeric presentation.

A1. Why locale should provide character classifying ? Can you classify Chinese/Japanese charters ?

A2. Not sure what are you asking, you are able to call ICU directly.

A3. UTF-8 is an encoding, not locale. There are specialize locales, like en_US.UTF-8, zh_CN.UTF-8, and so on. All these locales use UTF-8 to encode characters. You don't need to know the string locale, UTF-8 is able to encode all unicode characters. Unicode application is able to display all unicode glyph, no matter it is Chinese, Japanese or Thai.

BTW, boost regex provide a utf8 glyph iterator http://www.boost.org/doc/libs/1_66_0/libs/regex/doc/html/boost_regex/ref/internals/uni_iter.html

And, make sure ALWAYS use Unicode strings, use Unicode APIs and avoid Windows MBCS encoding.

Zang MingJie
  • 5,164
  • 1
  • 14
  • 27
  • 1
    **A1**: Boost.Locale does [character conversion](http://www.boost.org/doc/libs/1_62_0/libs/locale/doc/html/conversions.html). But you say Boost.Locale cannot do character classification? Is there a way to do it with Boost? Chinese/Japanese do have the space for example, don't they? **A2**: I can call ICU directly. But it takes an int32_t. How to get this from an UTF-8 string via Boost? **A3**: That is exactly why it confuses me that I need to provide a locale at all. But you say it does not matter which locale I provide as long as it is a UTF-8 locale? – sigy Jan 12 '18 at 08:14