This question is an extension of Do C++11 regular expressions work with UTF-8 strings?
#include <regex>
if (std::regex_match ("中", std::regex("中") )) // "\u4e2d" also works
std::cout << "matched\n";
The program is compiled on Mac Mountain Lion with clang++
with the following options:
clang++ -std=c++0x -stdlib=libc++
The code above works. This is a standard range regex "[一-龠々〆ヵヶ]"
for matching any Japanese Kanji or Chinese character. It works in Javascript and Ruby, but I can't seem to get ranges working in C++11, even with using a similar version [\u4E00-\u9fa0]
. The code below does not match the string.
if (std::regex_match ("中", std::regex("[一-龠々〆ヵヶ]")))
std::cout << "range matched\n";
Changing locale hasn't helped either. Any ideas?
EDIT
So I have found that all ranges work if you add a +
to the end. In this case [一-龠々〆ヵヶ]+
, but if you add {1}
[一-龠々〆ヵヶ]{1}
it does not work. Moreover, it seems to overreach it's boundaries. It won't match latin characters, but it will match は
which is \u306f
and ぁ
which is \u3041
. They both lie below \u4E00
nhahtdh also suggested regex_search which also works without adding +
but it still runs into the same problem as above by pulling values outside of its range. Played with the locales a bit as well. Mark Ransom suggests it treats the UTF-8 string as a dumb set of bytes, I think this is possibly what it is doing.
Further pushing the theory that UTF-8 is getting jumbled some how, [a-z]{1}
and [a-z]+
matches a
, but only [一-龠々〆ヵヶ]+
matches any of the characters, not [一-龠々〆ヵヶ]{1}
.