Background Information
We use SonarQube to obtain quality metrics regarding the codebase. SonarQube has flagged over a dozen bugs in our Node.js codebase, under rule S6324, related to an email validation regular expression advocated by a top ranking website on Google called emailregex.com. The website claims the regex is an RFC 5322 Official Standard. However, the control characters in the regex are flagged by SonarQube for removal because they're non-printable characters. Here is the regex:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
And here is the full list of control characters SonarQube complains about:
‘.\x0e…\x0e…\x0c…\x0c…\x0b…\x0c…\x1f…\x01…\x1f…\x01…\x01…\x09…\x08…\x0b…\x0b…\x0e…\x0b…\x08…\x0c…\x0e…\x09…\x01.’
Regular-Expressions.info's Email page does address a variation of the above regular expression as follows:
The reason you shouldn’t use this regex is that it is overly broad. Your application may not be able to handle all email addresses this regex allows. Domain-specific routing addresses can contain non-printable ASCII control characters, which can cause trouble if your application needs to display addresses...
However, I can't seem to find any information that explains why some sites are adding these non-printable control characters or what they mean by "domain-specific routing addresses". I have looked at some Stack Overflow regex questions and the Stack Overflow Regex Wiki. Control characters don't seem to be addressed.
The Question
Can someone please explain the purpose of these control-characters in the regular expression and possibly supply some examples of when this regular expression is useful?
(Note: Please avoid debates/discussion about what the best/worst regular expression is for validating emails. There doesn't seem to be agreement on that issue, which has been discussed and debated in many places on Stack Overflow and the broader Internet. This question is focused on understanding the purpose of control characters in the regular expression).
Update
I also reached out to the SonarQube community, and no one seems to have any answers.
Update
Still looking for authoritative answers which explain why the email regular expression above is specifically checking for non-printable control characters in email addresses.
There is this in the RFC5322 Section 5, but it's about the message body, not the address:
- Security Considerations
Care needs to be taken when displaying messages on a terminal or terminal emulator. Powerful terminals may act on escape sequences and other combinations of US-ASCII control characters with a variety of consequences. They can remap the keyboard or permit other modifications to the terminal that could lead to denial of service or even damaged data. They can trigger (sometimes programmable)