having trouble with Cyrillic characters

Question

I'm trying to use the standard <regex> library to match some Cyrillic words:

  // This is a UTF-8 file.
  std::locale::global(std::locale("en_US.UTF-8"));

  string s {"Каждый охотник желает знать где сидит фазан."};
  regex re {"[А-Яа-яЁё]+"};

  for (sregex_iterator it {s.begin(), s.end(), re}, end {}; it != end; it++) {
    cout << it->str() << "#";
  }

However, that doesn't seem work. The code above results in the following:

  Кажд�#й#о�#о�#ник#желае�#зна�#�#где#�#иди�#�#азан#

rather than the expected:

  Каждый#охотник#желает#знать#где#сидит#фазан

The code of the '�' symbol above is \321.

I've checked the regular expression I used with grep and it works as expected. My locale is en_US.UTF-8. Both GCC and Clang produce the same result.

Is there anything I'm missing? Is there a way to "tame" <regex> so it would work with Cyrillic characters?

i'm not so sure about it, but shouldn't you use `std::wstring` or `std::u32string` ,`std::wregex` or `boost::u32regex` and so on? — con ko, Mar 03 '20 at 07:58
@JohnDing You were absolutely correct. Using `wstring` et al did the trick. If you don't mind, I'll answer my own question shortly using this bit of knowledge. — undercat, Mar 03 '20 at 08:17

Olaf Dietsche · Accepted Answer · 2020-03-03T18:00:16.970

4

For ranges like А-Я to work properly, you must use std::regex::collate

Constants
...
collate Character ranges of the form "[a-b]" will be locale sensitive.

Changing the regular expression to

std::regex re{"[А-Яа-яЁё]+", std::regex::collate};

gives the expected result.

Depending on the encoding of your source file, you might need to prefix the regular expression string with u8

std::regex re{u8"[А-Яа-яЁё]+", std::regex::collate};

edited Mar 03 '20 at 18:00

answered Mar 03 '20 at 10:46

Olaf Dietsche

72,253
8
102
198

1

I can confirm this works and seems like a less intrusive solution compared to using wchars, surprised how few hits `regex::collate` has on Google! – undercat Mar 03 '20 at 11:55

undercat · Answer 2 · 2020-03-03T11:52:56.040

Cyrillic letters are represented as multibyte sequences in UTF-8. Therefore, one way of handling the problem is by using the "wide" version of string called wstring. Other functions and types working with wide characters need to be replaced with their "multibyte-conscious" version as well, generally this is done by prepending w to their name. This works:

std::locale::global(std::locale("en_US.UTF-8"));

wstring s {L"Каждый охотник желает знать где сидит фазан."};
wregex re {L"[А-Яа-яЁё]+"};

for (wsregex_iterator it {s.begin(), s.end(), re}, end {}; it != end; it++) {
  wcout << it->str() << "#";
}

Output:

Каждый#охотник#желает#знать#где#сидит#фазан#

(Thanks @JohnDing for pitching this solution.)

An alternative solution is to use regex::collate to make regexes locale-sensitive with ordinary strings, see this answer by @OlafDietsche for details. This topic will shed some light on which solution might be more preferable in your circumstances. (Turns out in my case collate was a better idea!)

This post indicates that it might work with utf-8: https://stackoverflow.com/questions/11254232/do-c11-regular-expressions-work-with-utf-8-strings?r=SearchResults — Alan Birtles, Mar 03 '20 at 08:35

having trouble with Cyrillic characters

2 Answers2