1

I need to be able to validate various formats of international email addresses in C++. I've been finding many of the answers online don't cut it and I found a solution that works well for me that I thought I would share for anyone that is using ATL Server Library

Some background. I started with this post: Using a regular expression to validate an email address. Which pointed to http://emailregex.com/ that had a regular expression in various languages that supports the RFC 5322 Official Standard of the internet messaging format.

The regular expression provided is

(?:[a-z0-9!#$%&'+/=?^_`{|}~-]+(?:.[a-z0-9!#$%&'+/=?^_`{|}~-]+)|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\[\x01-\x09\x0b\x0c\x0e-\x7f])")@(?:(?:a-z0-9?.)+a-z0-9?|[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\[\x01-\x09\x0b\x0c\x0e-\x7f])+)])

I'm using C++ with ATL Server Library which once upon a time used to be part of Visual Studio. Microsoft has since put it on CodePlex as open source. We use it still for some of the template libraries. My goal is to modify this regular expression so it works with CAtlRegEx

Community
  • 1
  • 1
tdemay
  • 649
  • 8
  • 23
  • Consider if you should: http://www.regular-expressions.info/email.html -- what are you using the email addresses for? What formats does the code that consumes it (displays it, emails, etc) require? Because a valid email address your engine cannot consume is a false positive. And a well formed email address that does not exist is another false positive. – Yakk - Adam Nevraumont May 05 '17 at 18:39
  • I think you are suggesting that we should reconsider validating the input at all. That's an excellent point. As I just confirmed Microsoft Outlook doesn't even check an invalid email address. It gladly accepted "j.@server1.proseware.com", although I quickly got an email reply that said the email address was malformed. I like your thought. In our product someone is setting up rules that licenses content to users based on their Active Directory email address. Our customers insist that we validate the rule at creation time so users using the rule later will not receive errors. – tdemay May 05 '17 at 20:11
  • I do disagree with the authors suggestion that that regex he provides covers 99% of the email addresses in use today. Maybe in the United States that's true. But it's definitely not the case globally as I'm getting my &^#@! handed to me in China at the moment. Which is what prompted this post. – tdemay May 05 '17 at 20:22

1 Answers1

1

The regular expression engine (CAtlRegExp) in ATL is pretty basic. I was able to modify the regular expression as follows:

^{([a-z0-9!#$%&'+/=?^_`{|}~\-]+(\.([a-z0-9!#$%&'+/=?^_`{|}~\-]+))*)@(((a-z0-9?\.)+a-z0-9?)|(\[(((2((5[0-5])|([0-4][0-9])))|(1[0-9][0-9])|([1-9]?[0-9]))\.)(((2((5[0-5])|([0-4][0-9])))|(1[0-9][0-9])|([1-9]?[0-9]))\.)(((2((5[0-5])|([0-4][0-9])))|(1[0-9][0-9])|([1-9]?[0-9]))\.)((2((5[0-5])|([0-4][0-9])))|(1[0-9][0-9])|([1-9]?[0-9]))\]))}$

The only thing that appears to be lost is Unicode support in domain names which I was able to solve by following the C# example in the How to: Verify that Strings Are in Valid Email Format article on MSDN by using IdnToAscii.

In this approach the user name and domain name are extracted from the email address. The domain name is converted to Ascii using IdnToAscii and then the two are put back together and then ran through the regular expression.

Please be aware that error handling was omitted for readability. Code is needed to make sure there are no buffer overruns and other error handling. Someone passing an email address over 255 characters will cause this example to crash.

Code:

bool WINAPI LocalLooksLikeEmailAddress(LPCWSTR lpszEmailAddress) 
{
    bool bRetVal = true ;
    const int ccbEmailAddressMaxLen = 255 ;
    wchar_t achANSIEmailAddress[ccbEmailAddressMaxLen] = { L'\0' } ;
    ATL::CAtlRegExp<> regexp ;
    ATL::CAtlREMatchContext<> regexpMatch ;
    ATL::REParseError status  = regexp.Parse(L"^{.+}@{.+}$", FALSE) ;
    if (status == REPARSE_ERROR_OK) {
        if (regexp.Match(lpszEmailAddress, &regexpMatch) && regexpMatch.m_uNumGroups == 2) {
            const CAtlREMatchContext<>::RECHAR* szStart = 0 ;
            const CAtlREMatchContext<>::RECHAR* szEnd   = 0 ;
            regexpMatch.GetMatch(0, &szStart, &szEnd) ;
            ::wcsncpy_s(achANSIEmailAddress, szStart, (size_t)(szEnd - szStart)) ;
            regexpMatch.GetMatch(1, &szStart, &szEnd) ;
            wchar_t achDomainName[ccbEmailAddressMaxLen] = { L'\0' } ;
            ::wcsncpy_s(achDomainName, szStart, (size_t)(szEnd - szStart)) ;

            if (bRetVal) {
                wchar_t achPunycode[ccbEmailAddressMaxLen] = { L'\0' } ;
                if (IdnToAscii(0, achDomainName, -1, achPunycode, ccbEmailAddressMaxLen) == 0)
                    bRetVal = false ;
                else {
                    ::wcscat_s(achANSIEmailAddress, L"@") ;
                    ::wcscat_s(achANSIEmailAddress, achPunycode) ;
                }
            }
        }
    } 

    if (bRetVal) {
        status = regexp.Parse(
            L"^{([a-z0-9!#$%&'*+/=?^_`{|}~\\-]+(\\.([a-z0-9!#$%&'*+/=?^_`{|}~\\-]+))*)@((([a-z0-9]([a-z0-9\\-]*[a-z0-9])?\\.)+[a-z0-9]([a-z0-9\\-]*[a-z0-9])?)|(\\[(((2((5[0-5])|([0-4][0-9])))|(1[0-9][0-9])|([1-9]?[0-9]))\\.)(((2((5[0-5])|([0-4][0-9])))|(1[0-9][0-9])|([1-9]?[0-9]))\\.)(((2((5[0-5])|([0-4][0-9])))|(1[0-9][0-9])|([1-9]?[0-9]))\\.)((2((5[0-5])|([0-4][0-9])))|(1[0-9][0-9])|([1-9]?[0-9]))\\]))}$"
            , FALSE) ;
        if (status == REPARSE_ERROR_OK) {
            bRetVal = regexp.Match(achANSIEmailAddress, &regexpMatch) != 0;
        } 
    }

    return bRetVal ;
}

One thing worth mentioning is this approach did not agree with the results in the C# MSDN article for two of the email addresses. Looking the original regular expression listed on http://emailregex.com suggests that the MSDN Article got it wrong, unless the specification has recently been changed. I decided to go with the regular expression mentioned on http://emailregex.com

Here's my unit tests using the same email addresses from the MSDN Article

#include <Windows.h>
#if _DEBUG
#define TESTEXPR(expr) _ASSERTE(expr)
#else
#define TESTEXPR(expr) if (!(expr)) throw ;
#endif

void main()
{
    LPCWSTR validEmailAddresses[] = {   L"david.jones@proseware.com", 
                                        L"d.j@server1.proseware.com",
                                        L"jones@ms1.proseware.com", 
                                        L"j@proseware.com9", 
                                        L"js#internal@proseware.com",
                                        L"j_9@[129.126.118.1]", 
                                        L"js*@proseware.com",            // <== according to https://msdn.microsoft.com/en-us/library/01escwtf(v=vs.110).aspx this is invalid
                                                                         // but according to http://emailregex.com/ that claims to support the RFC 5322 Official standard it's not. 
                                                                         // I'm going with valid
                                        L"js@proseware.com9", 
                                        L"j.s@server1.proseware.com",
                                        L"js@contoso.中国", 
                                        NULL } ;

    LPCWSTR invalidEmailAddresses[] = { L"j.@server1.proseware.com",
                                        L"\"j\\\"s\\\"\"@proseware.com", // <== according to https://msdn.microsoft.com/en-us/library/01escwtf(v=vs.110).aspx this is valid
                                                                         // but according to http://emailregex.com/ that claims to support the RFC 5322 Official standard it's not. 
                                                                         // I'm going with Invalid
                                        L"j..s@proseware.com",
                                        L"js@proseware..com",
                                        NULL } ;

    for (LPCWSTR* emailAddress = validEmailAddresses ; *emailAddress != NULL ; ++emailAddress)
    {
        TESTEXPR(LocalLooksLikeEmailAddress(*emailAddress)) ;
    }
    for (LPCWSTR* emailAddress = invalidEmailAddresses ; *emailAddress != NULL ; ++emailAddress)
    {
        TESTEXPR(!LocalLooksLikeEmailAddress(*emailAddress)) ;
    }
}
tdemay
  • 649
  • 8
  • 23