4

I am facing some issues in validating international email addresses like john.doe@神谕.com, sara.smith@神谕.com, babu.ratnakar+आଆఉఊګ神谕@gmail.com, testæœö.神谕#$&*éùôß@äßæçëêùé+आଆ神谕.com using REGEX in C++

The following Regex worked fine for me in Java:

^[\\p{L}0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[\\p{L}0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[\\p{L}0-9](?:[\\p{L}0-9-]*[\\p{L}0-9])?\\.)+[\\p{L}0-9](?:[\\p{L}0-9-]*[\\p{L}0-9])?$

I tried using the same with slight modification in C++

std::string str("[\\\\p{L}0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[\\\\p{L}0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[\\\\p{L}0-9](?:[\\\\p{L}0-9-]*[\\\\p{L}0-9])?\.)+[\\\\p{L}0-9](?:[\\\\p{L}0-9-]*[\\\\p{L}0-9])?"); 

std::regex rx4(str);

But regex_match fails on all cases. I think the issue is with \p{L}. When I replaced that with a-z, it accepts email addresses with english alphabets. ie this one is working:

std::regex rx3("[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?", std::regex::ECMAScript);

/p{L} to match unicode letters won't work in C++ ?

Timothy
  • 2,004
  • 3
  • 23
  • 29
vijin
  • 233
  • 4
  • 14
  • 2
    Did you obey the correct escaping (`\\`) when you build the regex string? – πάντα ῥεῖ Jun 22 '16 at 07:49
  • 3
    Short answer: [don't](https://davidcel.is/posts/stop-validating-email-addresses-with-regex/). Or at least, don't try too hard. You'll need to send them a verification email anyway, so just use a simple regex and then try to email them. – BoBTFish Jun 22 '16 at 07:55
  • yeah, I believe the escaping was correct. The only additional thing I had to put in was two more forward slashes '\' before '\\p{L}' – vijin Jun 22 '16 at 08:14
  • 1
    As pointed out - escaping. Java and c++ escape the same way so \\ should stay as it is, **not** \\\\. – SamWhan Jun 22 '16 at 08:15
  • When I kept it as just two slashes ie \\p{L} am getting the following error Microsoft C++ exception: std::tr1::regex_error at memory location 0x047fcbfc – vijin Jun 22 '16 at 08:22
  • 1
    I wonder if `"^(?:(?:[^<>()\\[\\].,;:\\s@\"]+(?:\\.[^<>()\\[\\].,;:\\s@\"]+)*)|\".+\")@(?:(?:[^<>()\\[\\].,;:\\s@\"]+\\.)+[^<>()\\[\\].,;:\\s@\"]{2,})$"` works for you. It is posted [on SO here](http://stackoverflow.com/a/46181/3832970). – Wiktor Stribiżew Jun 22 '16 at 08:45

1 Answers1

4

C++ std::regex supports 6 regex flavors:

Six different regular expression flavors or grammars are defined in std::regex_constants:

ECMAScript: Similar to JavaScript
basic: Similar to POSIX BRE.
extended: Similar to POSIX ERE.
grep: Same as basic, with the addition of treating line feeds as alternation operators.
egrep: Same as extended, with the addition of treating line feeds as alternation operators.
awk: Same as extended, with the addition of supporting common escapes for non-printable characters.

None of these support Unicode properties (or Unicode category classes) like \p{L}, thus you cannot use \p{L} in your patterns.

Use your workaround if it works for you:

std::regex rx3("[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?", std::regex::ECMAScript);

Or a version from a known Validate email address in JavaScript? SO post (removing anchors since you are using regex_match and re-escaping for use with a non-raw string literal, and std::regex::ECMAScript since it is used by default):

std::regex rx3("(?:(?:[^<>()\\[\\].,;:\\s@\"]+(?:\\.[^<>()\\[\\].,;:\\s@\"]+)*)|\".+\")@(?:(?:[^<>()‌​\\[\\].,;:\\s@\"]+\\.)+[^<>()\\[\\].,;:\\s@\"]{2,})")
Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • I tried your suggestion. But still finding some issues. I have used the regex in both java and C++. But the mail id 伊昭傑@郵件.商務 is accepted in java but rejected in C++ . The regex "^(?:(?:[^<>()\\[\\].,;:\\s@\"]+(?:\\.[^<>()\\[\\].,;:\\s@\"]+)*)|\".+\")@(?:(?:[^<>()‌​\\[\\].,;:\\s@\"]+\\.)+[^<>()\\[\\].,;:\\s@\"]{2,})$" is used in Java. In C++, std::regex rx3("(?:(?:[^<>()\\[\\].,;:\\s@\"]+(?:\\.[^<>()\\[\\].,;:\\s@\"]+)*)|\".+\")@(?:(?:[^<>()‌​\\[\\].,;:\\s@\"]+\\.)+[^<>()\\[\\].,;:\\s@\"]{2,})") is used. – vijin Jun 22 '16 at 13:05
  • Use [`^\S+@[^\s@]+\.[^\s@.]+$`](https://regex101.com/r/jB9eG8/1), why restrict the email so much? – Wiktor Stribiżew Jun 22 '16 at 13:10
  • Yeah. I understand. But trying to understand the diff here. Behaviour should be the same in java and cpp, right? – vijin Jun 22 '16 at 13:15
  • Is there any equivalent of \p{L} in C++ ? – vijin Jun 22 '16 at 13:34
  • 1
    Use Boost regex library, or PCRE, PCRE2, they have Unicode category support. BTW, `std::regex reg(R"((?:(?:[^<>()\[\].,;:\s@\"]+(?:\.[^<>()\[\].,;:\s@\"]+)*)|\".+\")@(?:(?:[^<>()\[\].,;:\s@\"]+\.)+[^<>()\[\].,;:\s@\"]{2,}))");` matches your email. See https://ideone.com/d26xH2 – Wiktor Stribiżew Jun 22 '16 at 13:36
  • Thanks for your help. Sorry for keep on asking silly doubts. Am a novice in this. Am getting compilation issue - R : undeclared identifier. – vijin Jun 22 '16 at 13:56
  • This is a raw string literal that you might not have support of since you are using TR1. You should really be using `std::regex`, not TR1 regex. Unless you are using Visual Studio 2008. `R"((?:(?:[^<>()\[\].,;:\s@\"]+(?:\.[^<>()\[\].,;:\s@\"]+)*)|\".+\")@(?:(?:[^‌​<>()\[\].,;:\s@\"]+\.)+[^<>()\[\].,;:\s@\"]{2,}))"` = `"(?:(?:[^<>()\\[\\].,;:\\s@\"]+(?:\\.[^<>()\\[\\].,;:\\s@\"]+)*)|\".+\")@(?:(?:[^‌​<>()\\[\\].,;:\\s@\"]+\\.)+[^<>()\\[\\].,;:\\s@\"]{2,})"`. – Wiktor Stribiżew Jun 22 '16 at 14:00