Character classification

Question

The simple question again: having an std::string, determine which of its characters are digits, symbols, white spaces etc. with respect to the user's language and regional settings (locale).

I managed to split the string into a set of characters using the boost locale boundary analysis tool:

std::string text = u8"生きるか死ぬか";

boost::locale::boundary::segment_index<std::string::const_iterator> characters(
    boost::locale::boundary::character,
    text.begin(), text.end(),
    boost::locale::generator()("ja_JP.UTF-8"));

for (const auto& ch : characters) {
    // each 'ch' is a single character in japanese language
}

However, I further do not see any way to determine if ch is a digit or a symbol or anything else. There are boost string classification algorithms, but these don't seem to be working with.. whatever *segment_index::iterator is.

Nor I can apply std::isalpha(std::locale), because I'm unsure if it is possible to convert the boost segment into a char or wchar_t.

Is there any neat way to classify symbols?

As usual: those, who downvote: why? What's wrong with the question? Do you definitely know the correct answer? — Ixanezis, Jun 30 '14 at 07:53
If I´m not wrong, type of `ch` is `segment`, and a segment is formed by a pair of iterators. So `ch` contains the pair of iterator on `text` to delimit a character. Because classify functions requires just a `char_type` value and you are using multibyte characters, you could convert each segment in a widechar string of just one character (if not surrogated in the string) and then use the classify function. It does make sense? — Gonmator, Jun 30 '14 at 08:52
@Gonmator: If I got you right, you're suggesting to convert the `std::string` to an `std::wstring` and use any `isdigit(str[0])`, assuming that `str[0]` now stands for a single wide character. If I'm not mistaken, this only increases the chance of the code to work correctly, but, there are still symbols that could not be represented with a single `wchar_t`, e.g. in "שָלוֹם". If I do rely on that, I can forget about boost boundary analysis and just always use the `str[0]` if `str` a wide-character string. — Ixanezis, Jun 30 '14 at 09:14
Yes, I realized strings with surrogated utf16 chars won't work. Maybe you can consider other library. In such case ogonek (that internally works with utf-32) could be useful: https://github.com/rmartinho/ogonek. (I've never tried it) — Gonmator, Jun 30 '14 at 09:31

score 3 · Answer 1 · answered Jun 30 '14 at 09:46

3

There are a number of functions and objects supporting this in <locale> but... The example text you give looks like UTF-8, which is a multibyte encoding, and the functions in <locale> don't work with multibyte encodings.

I'd suggest you get the ICU library, and use it. Amongst other things, it allows testing for all of the properties defined in the Unicode Character Database. It also has macros or functions for iterating over a string (or at least an array of char), extracting one UTF_32 codepoint at a time (which is what you'd want to test).

answered Jun 30 '14 at 09:46

James Kanze

150,581
18
184
329

Thanks. I was thinking of that `boost::locale` somehow provides the functionality I'm looking for, because it uses ICU by default inside. And more, I'm not sure that extracting a single UTF32 codepoint is what I'm looking for, because as boost library docs states, there are symbols composed of several UTF32 codepoints, such as שָ – Ixanezis Jul 17 '14 at 08:50

Character classification

1 Answers1

Linked