I'm trying to use the standard <regex>
library to match some Cyrillic words:
// This is a UTF-8 file.
std::locale::global(std::locale("en_US.UTF-8"));
string s {"Каждый охотник желает знать где сидит фазан."};
regex re {"[А-Яа-яЁё]+"};
for (sregex_iterator it {s.begin(), s.end(), re}, end {}; it != end; it++) {
cout << it->str() << "#";
}
However, that doesn't seem work. The code above results in the following:
Кажд�#й#о�#о�#ник#желае�#зна�#�#где#�#иди�#�#азан#
rather than the expected:
Каждый#охотник#желает#знать#где#сидит#фазан
The code of the '�' symbol above is \321
.
I've checked the regular expression I used with grep
and it works as expected. My locale is en_US.UTF-8
. Both GCC and Clang produce the same result.
Is there anything I'm missing? Is there a way to "tame" <regex>
so it would work with Cyrillic characters?