2

In my application I want to provide the ability to let users replace some string data using regular expressions. The strings may contain UTF-8 encoded octets and thus regex_replace should be UTF-8 aware.

On prototyping the regex functionality I got some strange behavior as demonstrated by the following program:

#include <iostream>
#include <string>
#include <regex>
int main() {
    std::locale::global(std::locale("en_US.UTF-8"));
    std::string origStr = u8"Süße Grüße aus Österreich wünschen FLÜSSE!";
    std::string patternStr = u8"\\w+";
    std::string replaceStr = u8"_$1_";
    std::cout << "REPLACE '" << patternStr << "' with '" << replaceStr << "' in '" << origStr << "'" << std::endl;

    std::cout << "C++ STL:" << std::endl;
    std::regex pattern(patternStr, std::regex_constants::ECMAScript);
    std::cout << std::regex_replace(origStr, pattern, replaceStr) << std::endl;

    std::cout << "C++ STL with icase:" << std::endl;
    std::regex patternIcase(patternStr, std::regex_constants::ECMAScript | std::regex_constants::icase);
    std::cout << std::regex_replace(origStr, patternIcase, replaceStr) << std::endl;
}

Output:

REPLACE '\w+' with '_$1_' in 'Süße Grüße aus Österreich wünschen FLÜSSE!'
C++ STL:
__ __ __ __ __ __!
C++ STL with icase:
__üß__ __üß__ __ Ö__ __ü__ __Ü__!

The first result line is what's expected. The second line, however, is not. The flag std::regex_constants::icase seems to prevent \w from matching UTF-8 characters. Why is this happening?

I also tried using boost::regex where \w doesn't match a UTF-8 character, no matter if case is ignored or not.

Command to build and run:

g++ -std=c++14 -Wall -pedantic simple-replace.cpp -o simple-replace && ./simple-replace

I'm using macOS High Sierra, g++ -v says: Apple LLVM version 10.0.0 (clang-1000.11.45.5).

Update:

As to why this should work at all, this answer says:

You would need to test your compiler and the system you are using, but in theory, it will be supported if your system has a UTF-8 locale.

Source: https://stackoverflow.com/a/11255698/2771733

The test string in origStr has no real meaning and translates to something like Sweet Greetings from Austria wish RIVERS! It's just used because of the different UTF-8 characters in there.

z80crew
  • 1,150
  • 1
  • 11
  • 20
  • `\w` is already case-insensitive, why do you need to add it? – ctwheels Dec 10 '19 at 17:40
  • 1
    `std::regex` isn't designed to work with `UTF-8` or any multibyte encoding, you need to convert to wide strings and use `std::wregex`. – Galik Dec 10 '19 at 17:40
  • You should use [CTRE](https://github.com/hanickadot/compile-time-regular-expressions) which has a branch for UTF-8 support – Guillaume Racicot Dec 10 '19 at 17:41
  • 1
    Possible duplicate of https://stackoverflow.com/questions/37989081/how-to-use-unicode-range-in-c-regex – Galik Dec 10 '19 at 17:42
  • Or use PCRE2 or another regular expression library that explicitly supports UTF-8 encoded text. – Shawn Dec 10 '19 at 19:15
  • @Galik I've added a link to an answer that states, that in principle it should work as regex should be locale aware. Additionally, my code shows that it does work as long as there's no `icase` flag set. – z80crew Dec 11 '19 at 11:56
  • @Galik As for the possible duplicate: that question isn't about using character classes or using the `icase` flag. And it doesn't address the usage of locales in combination with std::regex. – z80crew Dec 11 '19 at 12:04

0 Answers0