7

I have to use unicode range in a regex in C++. Basically what I need is to have a regex to accept all valid unicode characters..I just tried with the test expression and facing some issues with it.


std::regex reg("^[\\u0080-\\uDB7Fa-z0-9!#$%&'*+/=?^_`{|}~-]+$");

Is the issue is with \\u?

vijin
  • 233
  • 4
  • 14
  • Remove `\\u0080-\\uDB7F` and try to match `124`. If it matches, yes, the problem is with `\\u0080-\\uDB7F`. – Wiktor Stribiżew Jun 23 '16 at 10:32
  • The problem is C++ having no usable Unicode support. Use something like ICU. – Baum mit Augen Jun 23 '16 at 10:34
  • Or Boost is also a good alternative. BTW, [check this](http://en.cppreference.com/w/cpp/regex/ecmascript): *UnicodeEscapeSequence* is the letter `u` followed by exactly four *HexDigits*. This character escape matches the character whose code unit equals the numeric value of this four-digit hexadecimal number. If the value does not fit in this `std::basic_regex`'s *CharT*, `std::regex_error` is thrown(C++ only). – Wiktor Stribiżew Jun 23 '16 at 10:35
  • @WiktorStribiżew uDB7F and most stuff before that definitely does not fit into a `char`. – Baum mit Augen Jun 23 '16 at 10:43
  • 1
    @BaummitAugen: That is why perhaps `wregex` could help. I have no time to check that now – Wiktor Stribiżew Jun 23 '16 at 10:44
  • basically what I need is to have a regex to accept all valid unicode characters..The expression provided in the question was just a test regex. I'll modify the question accordingly. – vijin Jun 23 '16 at 10:49

1 Answers1

7

This should work fine but you need to use std::wregex and std::wsmatch. You will need to convert the source string and regular expression to wide character unicode (UTF-32 on Linux, UTF-16(ish) on Windows) to make it work.

This works for me where source text is UTF-8:

inline std::wstring from_utf8(const std::string& utf8)
{
    // code to convert from utf8 to utf32/utf16
}

inline std::string to_utf8(const std::wstring& ws)
{
    // code to convert from utf32/utf16 to utf8
}

int main()
{
    std::string test = "john.doe@神谕.com"; // utf8
    std::string expr = "[\\u0080-\\uDB7F]+"; // utf8

    std::wstring wtest = from_utf8(test);
    std::wstring wexpr = from_utf8(expr);

    std::wregex we(wexpr);
    std::wsmatch wm;
    if(std::regex_search(wtest, wm, we))
    {
        std::cout << to_utf8(wm.str(0)) << '\n';
    }
}

Output:

神谕

Note: If you need a UTF conversion library I used THIS ONE in the example above.

Edit: Or, you could use the functions given in this answer:

Any good solutions for C++ string code point and code unit?

Galik
  • 47,303
  • 4
  • 80
  • 117
  • Great answer, thanks! What does the `[\\u0080-\\uDB7F]+` range cover? `A-Z`? In that vein, what would be a regex for `[a-zA-Z0-9]`? – SexyBeast Jan 16 '18 at 23:40
  • 1
    @SexyBeast I just copied that range out of the OPs question. But you can see what it covers here: http://www.idevelopment.info/data/Programming/character_encodings/PROGRAMMING_character_encodings.shtml Also what you have written should work fine in a regex. – Galik Jan 17 '18 at 01:27