3

I'm trying to use the standard <regex> library to match some Cyrillic words:

  // This is a UTF-8 file.
  std::locale::global(std::locale("en_US.UTF-8"));

  string s {"Каждый охотник желает знать где сидит фазан."};
  regex re {"[А-Яа-яЁё]+"};

  for (sregex_iterator it {s.begin(), s.end(), re}, end {}; it != end; it++) {
    cout << it->str() << "#";
  }

However, that doesn't seem work. The code above results in the following:

  Кажд�#й#о�#о�#ник#желае�#зна�#�#где#�#иди�#�#азан#

rather than the expected:

  Каждый#охотник#желает#знать#где#сидит#фазан

The code of the '�' symbol above is \321.

I've checked the regular expression I used with grep and it works as expected. My locale is en_US.UTF-8. Both GCC and Clang produce the same result.

Is there anything I'm missing? Is there a way to "tame" <regex> so it would work with Cyrillic characters?

undercat
  • 529
  • 5
  • 17

2 Answers2

4

For ranges like А-Я to work properly, you must use std::regex::collate

Constants
...
collate Character ranges of the form "[a-b]" will be locale sensitive.

Changing the regular expression to

std::regex re{"[А-Яа-яЁё]+", std::regex::collate};

gives the expected result.


Depending on the encoding of your source file, you might need to prefix the regular expression string with u8

std::regex re{u8"[А-Яа-яЁё]+", std::regex::collate};
Olaf Dietsche
  • 72,253
  • 8
  • 102
  • 198
  • 1
    I can confirm this works and seems like a less intrusive solution compared to using wchars, surprised how few hits `regex::collate` has on Google! – undercat Mar 03 '20 at 11:55
1

Cyrillic letters are represented as multibyte sequences in UTF-8. Therefore, one way of handling the problem is by using the "wide" version of string called wstring. Other functions and types working with wide characters need to be replaced with their "multibyte-conscious" version as well, generally this is done by prepending w to their name. This works:

std::locale::global(std::locale("en_US.UTF-8"));

wstring s {L"Каждый охотник желает знать где сидит фазан."};
wregex re {L"[А-Яа-яЁё]+"};

for (wsregex_iterator it {s.begin(), s.end(), re}, end {}; it != end; it++) {
  wcout << it->str() << "#";
}

Output:

Каждый#охотник#желает#знать#где#сидит#фазан#

(Thanks @JohnDing for pitching this solution.)


An alternative solution is to use regex::collate to make regexes locale-sensitive with ordinary strings, see this answer by @OlafDietsche for details. This topic will shed some light on which solution might be more preferable in your circumstances. (Turns out in my case collate was a better idea!)
undercat
  • 529
  • 5
  • 17
  • This post indicates that it might work with utf-8: https://stackoverflow.com/questions/11254232/do-c11-regular-expressions-work-with-utf-8-strings?r=SearchResults – Alan Birtles Mar 03 '20 at 08:35