Do C++11 regular expressions work with UTF-8 strings?

Question

If I want to use C++11's regular expressions with unicode strings, will they work with char* as UTF-8 or do I have to convert them to a wchar_t* string?

Do I detect a confusion about Unicode/code points and the encoding schemes of Unicode here? — Maarten Bodewes, Jun 28 '12 at 23:50

Jeffery Thomas · Accepted Answer · 2017-02-15T04:34:27.553

18

You would need to test your compiler and the system you are using, but in theory, it will be supported if your system has a UTF-8 locale. The following test returned true for me on Clang/OS X.

bool test_unicode()
{
    std::locale old;
    std::locale::global(std::locale("en_US.UTF-8"));

    std::regex pattern("[[:alpha:]]+", std::regex_constants::extended);
    bool result = std::regex_match(std::string("abcdéfg"), pattern);

    std::locale::global(old);

    return result;
}

NOTE: This was compiled in a file what was UTF-8 encoded.

Just to be safe I also used a string with the explicit hex versions. It worked also.

bool test_unicode2()
{
    std::locale old;
    std::locale::global(std::locale("en_US.UTF-8"));

    std::regex pattern("[[:alpha:]]+", std::regex_constants::extended);
    bool result = std::regex_match(std::string("abcd\xC3\xA9""fg"), pattern);

    std::locale::global(old);

    return result;
}

Update test_unicode() still works for me

$ file regex-test.cpp 
regex-test.cpp: UTF-8 Unicode c program text

$ g++ --version
Configured with: --prefix=/Applications/Xcode-8.2.1.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 8.0.0 (clang-800.0.42.1)
Target: x86_64-apple-darwin15.6.0
Thread model: posix
InstalledDir: /Applications/Xcode-8.2.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

edited Feb 15 '17 at 04:34

answered Jun 29 '12 at 03:32

Jeffery Thomas

42,202
8
92
117

6

You don't need to save the source code in UTF-8 if you use `u8"abcdéfg"`. – R. Martinho Fernandes Jun 29 '12 at 07:51
Is locale so important? If you ignore locale at all? – Viet Nov 10 '12 at 11:32
1

@Viet There is always a locale. If you don't explicitly set the locale you need, then regex will process with the preexisting locale. I would not expect the regex to to work with UTF-8 strings if the locale is not compatible with UTF-8. – Jeffery Thomas Nov 11 '12 at 18:14
@Jeffery Thomas: Thanks. I googled a bit and found that this is applicable to Windows as well. – Viet Nov 12 '12 at 01:54
`"abcd\0xC3\0xA9fg"` is a string with two embedded null bytes. What you want is probably `"abcd\xC3\xA9""fg"`. Now, I tried this with clang on my Linux box and it quite clearly doesn't work :( https://gist.github.com/rmartinho/5349044 – R. Martinho Fernandes Apr 09 '13 at 20:38
And then I did some tests on a MacOS box and learned that while `[[:alpha:]]` can deal with multibyte characters fine, something as basic as `.` cannot: the regex `".."` matches the string `u8"é"` (or `"\xC3\xA9"`), which is just unacceptable. – R. Martinho Fernandes Apr 09 '13 at 20:58
`std::regex_match(u8"abcdéfg", std::regex("[[:alpha:]]+"))` fails for me (g++ 5.4.0 on Ubuntu). But `std::regex_match(L"abcdéfg", std::wregex(L"[[:alpha:]]+"))` works. (utf-8 locale is enabled in both cases) – jfs Feb 15 '17 at 03:39
@J.F.Sebastian I posted my stats. Ensure that the C++ source file is UTF-8 encoded. – Jeffery Thomas Feb 15 '17 at 04:35
@JefferyThomas: yes, I'm sure that the source code is utf-8 (though it is not necessary with `u8""`). Both `test_unicode()` and `test_unicode2()` return `false` (`g++ -std=c++11 *.cc && ./a.out`). Whatever ideone uses produces [the same result](http://ideone.com/5tJBDv). – jfs Feb 15 '17 at 05:34
GNU C++ libc++ regex library would not work for Japanese (or other multi-byte) characters. For that, you would have to use ICU library. – John Greene Aug 30 '17 at 18:49
1

@EgbertS The code I presented is for UTF-8 (which is a multi-byte encoding). If the Japanese text is encoded in a UTF-8 string, the code will work. If you are using another encoding (like Shift-JIS) you would need to convert it to UTF-8. – Jeffery Thomas Aug 30 '17 at 22:02

score 2 · Answer 2 · answered Jun 29 '12 at 01:10

2

C++11 regular expressions will "work with" UTF-8 just fine, for a minimal definition of "work". If you want "complete" Unicode regular expression support for UTF-8 strings, you will be better off with a library that supports that directly such as http://www.pcre.org/ .

answered Jun 29 '12 at 01:10

wjl

7,519
2
32
41

2

@ildjarn: ...which needs [ICU](http://site.icu-project.org/) support compiled in, which unfortunately is not the rule on all platforms, and can be quite a b**** to get to work. ICU, however, has RegEx support of its own... – DevSolar Apr 09 '13 at 11:15

score -1 · Answer 3 · answered Jun 29 '12 at 20:46

-1

Yes they will, this is by design of the UTF-8 encoding. Substring operations should work correctly if the string is treated as an array of bytes rather than an array of codepoints.

See FAQ #18 here: http://www.utf8everywhere.org/#faq.validation about how this is achieved in this encoding's design.

answered Jun 29 '12 at 20:46

Pavel Radzivilovsky

18,794
5
57
67

2

Regex matching is not a "substring operation". – R. Martinho Fernandes Apr 09 '13 at 20:04

kayleeFrye_onDeck · Answer 4 · 2018-06-08T03:55:24.947

I have a use-case where I need to handle potentially unicode strings when looking for Cartesian coordinates, and this sample shows how I handle it as advised for std::wregex and std::wstring, against potentially unicode characters for a parsing module.

static bool isCoordinate(std::wstring token)
{   
    std::wregex re(L"^(-?[[:digit:]]+)$");
    std::wsmatch match;
    return std::regex_search(token, match, re);
}

int wmain(int argc, wchar_t * argv[])
{
    // Testing against not a number nor unicode designation
    bool coord = ::isCoordinate(L"أَبْجَدِيَّة عَرَبِيَّة‎中文"); 

    if (!coord)
        return 0;
    return 1;
}

Do C++11 regular expressions work with UTF-8 strings?

4 Answers4

Linked