5

Background Information

We use SonarQube to obtain quality metrics regarding the codebase. SonarQube has flagged over a dozen bugs in our Node.js codebase, under rule S6324, related to an email validation regular expression advocated by a top ranking website on Google called emailregex.com. The website claims the regex is an RFC 5322 Official Standard. However, the control characters in the regex are flagged by SonarQube for removal because they're non-printable characters. Here is the regex:

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

And here is the full list of control characters SonarQube complains about: ‘.\x0e…\x0e…\x0c…\x0c…\x0b…\x0c…\x1f…\x01…\x1f…\x01…\x01…\x09…\x08…\x0b…\x0b…\x0e…\x0b…\x08…\x0c…\x0e…\x09…\x01.’

Regular-Expressions.info's Email page does address a variation of the above regular expression as follows:

The reason you shouldn’t use this regex is that it is overly broad. Your application may not be able to handle all email addresses this regex allows. Domain-specific routing addresses can contain non-printable ASCII control characters, which can cause trouble if your application needs to display addresses...

However, I can't seem to find any information that explains why some sites are adding these non-printable control characters or what they mean by "domain-specific routing addresses". I have looked at some Stack Overflow regex questions and the Stack Overflow Regex Wiki. Control characters don't seem to be addressed.

The Question

Can someone please explain the purpose of these control-characters in the regular expression and possibly supply some examples of when this regular expression is useful?

(Note: Please avoid debates/discussion about what the best/worst regular expression is for validating emails. There doesn't seem to be agreement on that issue, which has been discussed and debated in many places on Stack Overflow and the broader Internet. This question is focused on understanding the purpose of control characters in the regular expression).

Update

I also reached out to the SonarQube community, and no one seems to have any answers.

Update

Still looking for authoritative answers which explain why the email regular expression above is specifically checking for non-printable control characters in email addresses.

There is this in the RFC5322 Section 5, but it's about the message body, not the address:

  1. Security Considerations

Care needs to be taken when displaying messages on a terminal or terminal emulator. Powerful terminals may act on escape sequences and other combinations of US-ASCII control characters with a variety of consequences. They can remap the keyboard or permit other modifications to the terminal that could lead to denial of service or even damaged data. They can trigger (sometimes programmable)

halfer
  • 19,824
  • 17
  • 99
  • 186
jamesmortensen
  • 33,636
  • 11
  • 99
  • 120
  • Your bounty says "_why the non-printable control characters are **needed** in this email validation regex._", and your question says "_and possibly supply some examples of when this regular expression is **useful**_". This makes me wonder if you really mean to ask "_In what _context_ would _a person_ need/want a regex that complies to the RFC specs (instead of a simpler subset)?_" – starball Sep 22 '22 at 06:08
  • Can you clarify what you mean by "authoritative"? Whose authority do you want? The RFC authors'? The authors of emailregex.com's? You already have the RFC docs. As for emailregex.com, you are much more likely to get an answer by contacting them directly. I highly doubt they will stumble upon this question, or that they have publicized a past answer to such a question (if they had, you should have found it in your research already). Likely, they will say: "our regex is made to match any email address as defined by the RFC specs, so such inclusion of these characters is necessary." – starball Sep 24 '22 at 21:55
  • The last part of your ask, about section 5 perfects answer your question. This regex is for general propuses, so it assumes input from others ways beyond keyboard, so tty can send along with string some controls chars, it prevents charset bugs when eg will store into database or keep safe that only a valid email with chars pass forward, nothing more. So if you meet your input and special controls will never dealed, move on and get a lightweight version of that regex. – Aloiso Junior Sep 27 '22 at 00:11

1 Answers1

10
The Purpose

Can someone please explain the purpose of these control-characters in the regular expression [...]?

The purpose of those non-printable control characters would be to create a regex that conforms closesly to the RFCs defining email address format.

Just in case anyone is wondering- yes- the control characters in this email regex really do conform to the RFC specs. I think validating this is outside the scope of this question so I won't quote the spec in detail, but here are links to the relevant sections: 3.2.3 (atoms), 3.2.4 (quoted strings), 3.4 (address specification), 3.4.1 (addr-spec specification), 4.1 (Misc Obsolete Tokens). In summary, the local part and domain part of the address are allowed to contain quoted strings, which are allowed to contain certain non-printable control characters.

Quoting from SonarQube rule S6324 (emphasis added):

Entries in the ASCII table below code 32 are known as control characters or non-printing characters. As they are not common in JavaScript strings, using these invisible characters in regular expressions is most likely a mistake.

Following a spec is not a mistake. When a lint rule that is usually helpful hits a case in peoples' code where it is not helpful, people usually just use the lint tool's case-by-case ignore mechanism. I think this addresses the second clause of your bounty, which states:

What is a better alternative that will avoid breaking our site while also passing SonarQube's quality gate?

Ie. Use one of the provided mechanisms to make SonarQube ignore those rule violations. You could also choose to opt out of checking that rule entirely, but that's probably overkill.

For SonarQube, use NOSONAR comments to disable warnings on a case-by-case basis.

Examples of Usefulness

This comes down to context.

If your end goal is purely to validate whether any given email address is a valid email address as defined by the RFCs, then a regex that closely follows the RFC specs is very useful.

That's not everyone's end goal. Quoting from wikipedia:

Despite the wide range of special characters which are technically valid, organisations, mail services, mail servers and mail clients in practice often do not accept all of them. For example, Windows Live Hotmail only allows creation of email addresses using alphanumerics, dot (.), underscore (_) and hyphen (-). Common advice is to avoid using some special characters to avoid the risk of rejected emails.

There's nothing there that explains why most applications do not fully adhere to the spec, but you could speculate, or you could go try and ask their maintainers. For example, considerations such as simplicity could- in someone's context- be declared or seen as more important than full RFC complicance.

If your goal was to check if a given email address is a valid hotmail email address and to reject email addresses that are allowed by the RFCs but not by the subset that hotmail uses, then full RFC compliance would not be necessary (useful).

starball
  • 20,030
  • 7
  • 43
  • 238