In my application I want to provide the ability to let users replace some string data using regular expressions. The strings may contain UTF-8 encoded octets and thus regex_replace should be UTF-8 aware.
On prototyping the regex functionality I got some strange behavior as demonstrated by the following program:
#include <iostream>
#include <string>
#include <regex>
int main() {
std::locale::global(std::locale("en_US.UTF-8"));
std::string origStr = u8"Süße Grüße aus Österreich wünschen FLÜSSE!";
std::string patternStr = u8"\\w+";
std::string replaceStr = u8"_$1_";
std::cout << "REPLACE '" << patternStr << "' with '" << replaceStr << "' in '" << origStr << "'" << std::endl;
std::cout << "C++ STL:" << std::endl;
std::regex pattern(patternStr, std::regex_constants::ECMAScript);
std::cout << std::regex_replace(origStr, pattern, replaceStr) << std::endl;
std::cout << "C++ STL with icase:" << std::endl;
std::regex patternIcase(patternStr, std::regex_constants::ECMAScript | std::regex_constants::icase);
std::cout << std::regex_replace(origStr, patternIcase, replaceStr) << std::endl;
}
Output:
REPLACE '\w+' with '_$1_' in 'Süße Grüße aus Österreich wünschen FLÜSSE!'
C++ STL:
__ __ __ __ __ __!
C++ STL with icase:
__üß__ __üß__ __ Ö__ __ü__ __Ü__!
The first result line is what's expected. The second line, however, is not. The flag std::regex_constants::icase
seems to prevent \w
from matching UTF-8 characters. Why is this happening?
I also tried using boost::regex where \w
doesn't match a UTF-8 character, no matter if case is ignored or not.
Command to build and run:
g++ -std=c++14 -Wall -pedantic simple-replace.cpp -o simple-replace && ./simple-replace
I'm using macOS High Sierra, g++ -v
says: Apple LLVM version 10.0.0 (clang-1000.11.45.5)
.
Update:
As to why this should work at all, this answer says:
You would need to test your compiler and the system you are using, but in theory, it will be supported if your system has a UTF-8 locale.
Source: https://stackoverflow.com/a/11255698/2771733
The test string in origStr has no real meaning and translates to something like Sweet Greetings from Austria wish RIVERS! It's just used because of the different UTF-8 characters in there.