Can it cause harm to validate email addresses with a regex?

Question

I've heard that it is a bad thing to validate email addresses with a regex, and that it actually can cause harm. Why is that?

I thought it never could be a bad thing to validate data. Maybe unnecessary, but never a bad thing provided that you perform the validation correctly. Why is this right or wrong? If it can cause harm, please give an example.

Why is it a bad thing to only "validate" a CC number on purchase? — user2864740, Jan 02 '18 at 04:01
This question is being [discussed on meta](https://meta.stackoverflow.com/questions/417847). — cigien, May 04 '22 at 16:00

bly · Answer 1 · 2022-04-14T08:49:21.743

39

In general, yes - using regular expressions to validate email addresses is harmful. This is because of bad (incorrect) assumptions by the author of the regular expression.

As klutt indicated, an email address has two parts, the local-part and the domain. It's worth noting some things about these parts that aren't immediately obvious:

The local-part can contain escaped characters and even additional @ characters.
The local-part can be case sensitive, however it is up to the mail server at that specific domain how it wants to distinguish case.
The domain part can contain zero or more labels separated by a period (.), though in practice there are no MX records corresponding to the root (zero labels) or on the TLDs (one label) themselves.

So, there are some checks that you can do without rejecting valid email addresses that correspond with the above:

Address contains at least one @
The local-part (everything to the left of the rightmost @) is non-empty
The domain part (everything to the right of the rightmost @) contains at least one period (again, this isn't strictly true, but pragmatic)

That's it. As others have pointed out, it's best practice to test deliverability to that address. This will establish two important things:

Whether the email currently exists; and
That the user has access to the email address (is the legitimate user or owner)

If you build email activation processes into your business process, you don't need to worry about complicated regular expressions that have issues.

Some further reading for reference:

RFC 5321: Simple Mail Transfer Protocol

OWASP: Input Validation Cheat Sheet

edited Apr 14 '22 at 08:49

answered Jan 09 '18 at 14:31

bly

1,532
1
12
19

You might also want to consider how you treat email address validity with mixed case (remembering that the `local-part` might be case sensitive). I'd recommend converting the `domain` part to lowercase and giving some thought to how you'll deal with customer experience logging in with and resetting passwords in cases where the users mail server isn't case sensitive (they might enter the email in one capitalisation but try to use it a different way later). Disclaimer: I wrote a significant portion of the email address validation content on the OWASP article. – bly Jan 09 '18 at 14:32
3

As to the harm that can be caused - if you have a false negative (reject an email address that is valid), you're turning away legitimate users that might otherwise pay for your goods. – bly Jan 09 '18 at 14:40
1

that is not harm; it's unfortunate, but that user will be able to create a new email and likely have it transparently forward quite easily – MrMesees Apr 24 '20 at 06:02
2

@MrMesees from an economical perspective, discouraging/turning away would-be users from your system is categorically harm, I don't see how it could be anything but. I use secrets for the email strings on most sites e.g. `rogue+somesecretthing@abc.xyz`, and the number of times I've decided to simply not register on a site that disallows `+` is not a single-digit number – Rogue Nov 15 '20 at 16:55
1

@Rogue that argument could be used for accepting alternate date and time formats and anything else that would explode complexity. Perhaps users don't like uploading images in a format you like. You need to put limits somewhere. Users can be fairly well in-control of their emails. In the case of +, which is not banned or discussed here, I use it, so it would suck if someone blocked it, but it's uncommon and a trade-off for them between being able to exact-match and aggregate emails across services, and not. I Say let them make their choices. We'll make ours. – MrMesees Nov 19 '20 at 23:52
@MrMesees It's really not adding complexity, you should just check for `@` and a valid trailing domain. It's all of the added complexity that _breaks_ simple cases like `rogue+secret@email.com` – Rogue Nov 20 '20 at 00:31
2

@MrMeses I agree. It's a matter of balance. On one hand, sure, a user can have an escaped double quoted email address, but it's so uncommon you may not ever encounter it and when you do, lets be honest, that user was being esoteric on purpose and your site isn't the first time they've had an issue. On the other hand, typos like autocorrect on mobile adding spaces after dots occur regularly and also present obstacles to clickthrough or conversion. Not addressing the latter in favor of the former seems unwise. You can allow + without abandoning regex. – Kyle Alm Dec 15 '21 at 12:41
What is a *"gTLD"*? Do you mean *[TLD](https://en.wikipedia.org/wiki/Top-level_domain)*? – Peter Mortensen Feb 10 '22 at 21:05
1

@PeterMortensen I did mean TLD! [gTLD](https://en.wikipedia.org/wiki/Generic_top-level_domain) is a thing but not what I had intended here - I have edited for clarity, thanks! – bly Apr 14 '22 at 08:51

klutt · Answer 2 · 2023-08-22T10:09:36.793

TL;DR

Don't use regexes for validating emails, unless you have a good reason to use them. Use a verification mail instead. In most cases, a regex that simply checks that the string contains an @ is enough.

Short version

In most cases, the question "How do I validate an email address with a regex" is an XY-problem because it's most likely not the solution to your actual problem. The real problem is probably "How do I make sure that the the email address the user is entering can be used to communicate with the user?" or as zsalya mentioned in comments "What sanitization should you apply to a user-entered email address before storing it in your database?"

Constructing regexes for validating emails can be a good and fun exercise, but in general, you should really avoid it in production code. The proper way of verifying an email address is in most cases to send a verification mail. Trying to verify if a mail address matches the specification is very tricky, and even if you get it right, it's still often useless information unless you know that it's a mail address that you can send mails to and that someone reads.

Think of it. How often do you have use for storing a mail address that's wrong?

If you're just want to make sure that a user does not mix up input fields, check that the mail address contains a @ character. That's enough. Well, it would not catch those who insists on that character in user names or passwords, but that's their head ache. ;)

Long version

In a majority of the cases where you would want to use this, just knowing that the email address is valid does not mean a thing. What you really want to know is if it is the right email address.

The reason may differ. You may want to send newsletters, use it for regular communication, password recovery or something else. But whatever it is, it's important that it is the right address. It's not important to know if the address fulfills a complicated standard. The only important thing is to know if it can be used for the purpose you have of storing the address.

The proper way to verify this is by sending a mail with a verification link.

If you have verified the email address with a verification link, there's often no point in checking if it is a correct email address, since you know it works. It could however be used for basically checking that the user is entering the email address in the correct field. My advice in this case is to be extremely forgiving. I'd say it's enough to just check that it is a @ in the field. It's a simple check and ALL email addresses includes a @. If you want to make it more complicated than that, I would suggest just warning the user that it might be something wrong with the address, but not forbidding it. A pretty simple regex that would have extremely few false negatives (if any) is

.+@.+\..+

This means a non empty string before @ followed by a non empty domain, a dot and a non empty top domain. But actually, I'd just stick with @.+ which means that the right part is non empty, and I don't know of any dns server that would accept an empty server name.

Properly checking an email against the standard is actually really tricky

But one worse concern is that a regex for accurately verifying an email address is actually a very complex matter. If you try to create a regex on your own, you will almost certainly make mistakes. One thing worth mentioning here is that the standard RFC 5322 does allow comments within parentheses. To make things worse, nested comments are allowed. A standard regex cannot match nested patterns. You will need extended regex for this. While extended regexes are not unusual, it does say something about the complexity. And even if you get it right, will you update the regex when a new standard comes?

The mail server might support non-standard addresses

And one more thing, even if you get it 100% right, that still may not be enough. An email address has the local part on the left side of the @ and domain part on the right. Everything in the local part is meant to be handled by the server. Sure, RFC 5322 is pretty detailed about what a valid local part looks like, but what if a particular email server accepts addresses that is not valid according to RFC 5322? Are you really sure you don't want to allow a particular email address that does work just because it does not follow the standard? Do you want to lose customers for your business just because they have chosen an obscure email provider? Or because you have made a mistake in the regex? (Hint: It's very easy to make mistakes with language specific characters)

Here, I might add that I have experienced not being able to register to various web sites because of my email address. And I don't even have a strange address. It's simply <name>@protonmail.com but some sites claims that it's not a valid address. I have a hard time believing it's because of <name>, since it only contains 12 lowercase letters from a-z.

If you really want to check if an address is correct in production code, then use MailAddress class or something equivalent. But first take a minute to ponder if this really is what you want. Ask yourself if the address has any value if it is not the correct address. If the answer is no, then you don't. Use verification links instead.

That being said, it can be a good thing to validate input. The important thing is to know why you are doing it. Validating the email with a regex or (preferably) something like the Mailaddress class could give some protection against malicious input, such as SQL injections and such. But if this is the only method you have to protect you against malicious input, then you're doing something else very wrong.

Since there are existing email "conforms to RFC address" libraries available, those can be used to mitigate those aspects of this answer.. however, it's really the same deal as when dealing with a Phone # / URI / Address / CC / etc. - being "syntactically valid", or even "is *an* [..] somewhere" doesn't indicate that it's actually *usable* for the purpose (eg. of contacting the user, sending a bill, making a payment) which is *often* - but not always! - the desired information. — user2864740, Jan 02 '18 at 04:03
I agree that a verification link is needed, but this doesn't mean that validation is pointless. — Tim Biegeleisen, Jan 02 '18 at 04:05
@Xatenev My point is that if you know that the email address works, then it's almost never any point in checking if it fits some standard. Do you disagree? — klutt, Jan 02 '18 at 04:05
@klutt I disagree in that a validator can capture *some* forms of user entry error (or even malicious input, depending on how this is defined). This is why physical addresses are validated enough though they generally can't be "proven" until mail is sent/accepted. (Physical addresses are harder to validate than email addresses so.. then again, this is what external *libraries and services* are for.) — user2864740, Jan 02 '18 at 04:07
@klutt You are usually trying to design you're software for the luser(https://en.wikipedia.org/wiki/Luser) which might end up typing in the wrong email address accidentally. He will be able to register succesfully but won't receive an email and might never come back to your page. Of course you can't catch all forms of things people might enter but it makes sense to help them **as much as you can**. — Xatenev, Jan 02 '18 at 04:07
If you validate your form by using a "type=email" field with REQUIRED set to true, and then you get a form input that has an invalid or missing email field, you know that it's likely that the POST has arrived from a bot and not a live user, and can act accordingly. — Marc Wilson, Mar 11 '21 at 16:57
@MarcWilson Nice one. Valid point. But then they have a valid primary purpose with the validation. You can always go against "the rules" if you have a good enough reason. This answer is most about why you should not just do it "just because you should". About why it's not necessarily harmless. — klutt, Mar 11 '21 at 21:26
Let's turn the question round: "What sanitization should you apply to a user-entered email address before storing it in your database?" — zsalya, Mar 23 '23 at 13:45

score 9 · Answer 3 · edited May 10 '22 at 14:07

9

In addition to other answers, I would like to point out, that regex engines that use backtracking are susceptible to ReDoS - regex denial of service attacks. The attack is based on the fact that many non-trivial regular expressions have inputs that can take an extraordinary amount of CPU cycles to produce a non-match.

Crafting such an input might cause trouble to the availability of the site even with small botnet.

Mitigations of the issue:

it is often possible to rewrite the regex expression to avoid catastrophic backtracking; or:
using a regex engine without support for backtracking - while most support it, engines without such support do exist - a notable example would be the RE2 regex engine used by Go/Golang.

For more information: "Regular Expressions Denial of the Service (ReDoS) Attacks"

edited May 10 '22 at 14:07

Peter Mortensen

30,738
21
105
131

answered Jan 12 '18 at 20:48

Mindaugas Bernatavičius

3,757
4
31
58

This is misleading: Only RegEx engines that use backtracking can be susceptible to a DoS attack; and even then, the regular expression has to be written in such a way that it would be susceptible. It's generally possible to rewrite regular expressions to not be vulnerable, even on backtracking engines. – awwright Mar 31 '22 at 00:01
@awwright I updated the answer, however I don't think it was misleading. There are no incorrect generalizations like "all regex engines" or "all non-trivial expressions" used in the original answer. A thinking reader will probably read the article not this short post if he likes to know more (that's why I linked the article - to avoid writing all the details and nuance in the answer). – Mindaugas Bernatavičius Mar 31 '22 at 06:26

Stephen C · Answer 4 · 2021-05-21T02:49:27.657

It is not inherently bad to validate email addresses.

It is not even inherently bad to validate email addresses using regexes ... though there are arguably better ways to validate them¹.

The real issues are that validation of email addresses based on the syntax is ineffective:

It does not tell you if the address corresponds to a valid, working mailbox.
It does not tell you if it is an address for the correct user (or agent).

Since users often accidentally (or deliberately²) enter syntactically valid but incorrect email addresses, you need to do something else if you need to know if the address is the correct address for the person involved. For example, you could send some kind of "activation" or "confirmation" email to the address provided.

So, assuming that you are going to implement the second stage of checking, the first stage of syntax checking the email address is relatively unimportant, and not even strictly necessary.

^{1 - Creating a regex that correctly deals with all of the edge-cases in the email syntax is non-trivial. However, it may be acceptable to disallow some of the more abstruse edge-cases, provided it doesn't unduly inconvenience a significant number of users.

2 - Regex validation is next to useless for filtering out deliberately fake email addresses.}

Levi Morrison · Answer 5 · 2018-01-09T13:53:19.733

If your regular expression is ill-formed then you might deny valid email addresses. This goes for any "email validation" rule.

I know of an email address which is regularly denied by forms which doesn't contain any email oddities; it's merely long. It really annoys the person it belongs to because the part before the @ is their legal name - an obvious choice for an email address.

That is part of the potential harm of email validation done incorrectly: annoying users by denying valid email addresses from entering the system.

score 1 · Answer 6 · edited Feb 08 '22 at 01:30

I've heard that it is a bad thing to validate email addresses with a regex, and that it actually can cause harm. Why is that?

This is correct. The regex solution is attractive, because an email address is a structured string, and regex is used to find structure in strings.

It is also the wrong solution, because when you ask the user for an email address, it is usually so you can contact them.

The validation is incorrect because:

the address may be valid, but not an address the user has access to. I could fill in the address billgates@microsoft.com to any form, and it will probably be accepted as a valid email address ( disclaimer: I am not Bill Gates :) ).
the syntax for email addresses is very tricky to get correctly (see the examples here) - by defining your own regex for email validation, you will end up rejecting valid addresses, and accepting invalid ones.

I thought it never could be a bad thing to validate data.

It's not bad to validate data. In this case though, you will provide a feature in your application, that is defective by design:

Your application looks to your developers as if it is validating the input, but the validation is unnecessary, probably incomplete, and at the end of the validation, you don't know if you have an address that will allow you to contact the user.

Maybe unnecessary, but never a bad thing provided that you perform the validation correctly.

It is not unnecessary; it is necessary. It's just that regex is the wrong tool for it.

At the end of the day, the best way to check that the address is valid for the user is unique token exchange for that address:

send an email to the address, containing a unique random token (store token with user data)
ask user in the email to "click the link/button", effectively sending you the token back.
verify the token.

score -2 · Answer 7 · edited Feb 10 '22 at 21:17

Regex is not harmful.

Use a good email regex to filter the impatient fake user.

If you are selling to that individual, you might want to contact them for further validation, though sellers don't care about email too much and just validating the credit card is good enough for them.

Otherwise, the only other place where validation is necessary is when someone wants access to and interact with your forum, and for some reason you want get remuneration by selling their email to mass advertisers, even though you say you won't do that.

A general email regex in the HTML5 specification is this -

^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$

http://www.w3.org/TR/html5/forms.html#valid-e-mail-address

 ^
 [a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+
 @
 [a-zA-Z0-9]
 (?:
      [a-zA-Z0-9-]{0,61}
      [a-zA-Z0-9]
 )?
 (?:
      \.
      [a-zA-Z0-9]
      (?:
           [a-zA-Z0-9-]{0,61}
           [a-zA-Z0-9]
      )?
 )*
 $

What is the value in "filtering the impatient fake user"? If such user doesn't pass the regex check, they will just replace it with something like dfsjalkdsfahj@example.com and call it a day. It serves no value IMO, the other answers with more votes speak about it in detail. — Nicofisi, Apr 14 '22 at 11:29

score -2 · Answer 8 · edited Feb 10 '22 at 21:21

A regular expression is probably the best way to validate an email address; so long as you use the correct one. Once you've checked the address with a regular expression, there's only a few additional requirements that must be checked (that the address is not too long, and that it is valid UTF-8).

This is because the ABNF grammar that defines the form of email addresses is "regular", which means it can be described exactly as a regular expression; without backtracking, recursion, or any non-regular features.

It's only a matter of understanding the specification; but once you do that, it turns out the regular expression for email address is actually very simple: How can I validate an email address using a regular expression?