2

I want for example to consider "ram" , "rém" , "rèm" and "ràm" as a valid input, so i do this:

std::string ss = "rém";
bool valid = std::regex_match(ss, std::regex("r[aéèà]m"));

but in this case 'valid' returns false, is there something special with the characters é, è and à ? Should i modify the regex expression ? Thanks

Jaziri Rami
  • 147
  • 1
  • 12
  • Likely a bug in the implementation. Can you try the same on boost regex? – Tanveer Badar Apr 20 '20 at 15:16
  • What is the encoding used? `std::string` doesn't support UTF... Prefer `wstring`. – Jean-Baptiste Yunès Apr 20 '20 at 15:20
  • I get `true` after running this code in VS2017. – Wiktor Stribiżew Apr 20 '20 at 15:20
  • This might be a dupe but I'm hesitant to hammer it: https://stackoverflow.com/q/23932970/10077 – Fred Larson Apr 20 '20 at 15:21
  • @FredLarson I think it is the problem... I have another link on the same problem. You can close. – Jean-Baptiste Yunès Apr 20 '20 at 15:21
  • 1
    Try declaring `std::wstring ss = L"rém"` and then use `std::wcout << std::regex_match(ss, std::wregex(L"r[aéèà]m"));` – Wiktor Stribiżew Apr 20 '20 at 15:24
  • @WiktorStribiżew you are right, i convert to std::wstring and it works. In fact, in my project i have the string and the regex as an UTF8 std::string input , so é for example is encoded as é and that's why we need to convert to UTF16 before doing std::regex_match – Jaziri Rami Apr 21 '20 at 14:04
  • Please let me know if the solutions in the duplicate links work for you, or if your question should be reopened. – Wiktor Stribiżew Apr 21 '20 at 14:06
  • @WiktorStribiżew i think that it's not exactly the same. the first link use boost::regex and the second is about using unicode representation (\\u0080 for example) in the regex instead of the latin represent which not easy to read and understand whithout using the doc – Jaziri Rami Apr 21 '20 at 14:15
  • @WiktorStribiżew but i think there is another issue, whith the same regex "r[aéèà]m", if the input is "rmm" so the regex_match return false which is ok, but then if i put "rMm" i get True which is not ok, is there something to add for uppercase characters ? – Jaziri Rami Apr 21 '20 at 14:27
  • With `L"rMm"` as input, I get `0` as output, so no match. – Wiktor Stribiżew Apr 21 '20 at 14:30

1 Answers1

1

You may make use of std::wstring to define the string and then use std::wregex to actually run regex on Unicode strings:

std::wstring ss = L"rém";
std::wcout << std::regex_match(ss, std::wregex(L"r[aéèà]m"));
// => 1, there is a match
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563