Why are people using regexp for email and other complex validation?

Question

There are a number of email regexp questions popping up here, and I'm honestly baffled why people are using these insanely obtuse matching expressions rather than a very simple parser that splits the email up into the name and domain tokens, and then validates those against the valid characters allowed for name (there's no further check that can be done on this portion) and the valid characters for the domain (and I suppose you could add checking for all the world's TLDs, and then another level of second level domains for countries with such (ie, com.uk)).

The real problem is that the tlds and slds keep changing (contrary to popular belief), so you have to keep updating the regexp if you plan on doing all this high level checking whenever the root name servers send down a change.

Why not have a module that simply validates domains, which pulls from a database, or flat file, and optionally checks DNS for matching records?

I'm being serious here, why is everyone so keen on inventing the perfect regexp for this? It doesn't seem to be a suitable solution to the problem...

Convince me that it's not only possible to do in regexp (and satisfy everyone) but that it's a better solution than a custom parser/validator.

-Adam

score 25 · Accepted Answer · answered Oct 17 '08 at 11:58

25

They do it because they see "I want to test whether this text matches the spec" and immediately think "I know, I'll use a regex!" without fully understanding the complexity of the spec or the limitations of regexes. Regexes are a wonderful, powerful tool for handling a wide variety of text-matching tasks, but they are not the perfect tool for every such task and it seems that many people who use them lose sight of that fact.

answered Oct 17 '08 at 11:58

Dave Sherohman

45,363
14
64
102

No idea why this got downvoted; that seems a perfect explanation of why people (incorrectly) fall into the regex trap. +1 from me... – Marc Gravell Oct 17 '08 at 12:04
4

So, to paraphrase, "When a programmer has a problem she thinks, "I'll use a regex!" Now the programmer has two problems." – Adam Davis Oct 17 '08 at 12:05
While I'm familiar with the MJD quote, I don't like its implication that regexes always make your problem worse. They're a good choice in many cases, just not this one. – Dave Sherohman Oct 17 '08 at 12:30
(Correction: JWZ quote, not MJD...) – Dave Sherohman Oct 17 '08 at 12:35
2

I don't necessarily disagree with your point that regexes are not the way to go, but you don't really say anything to help your case. I think your answer can be paraphrased into: "Oh, regexes are good for some stuff, but not for this." That leaves an ignorant dev just as ignorant. – Daniel Apr 16 '09 at 13:11
Perhaps, but the question here is "why do people do this?", not "why shouldn't you do it?". The question itself goes into why you shouldn't use regexes for this, so there's no need to address that point when answering it. – Dave Sherohman Apr 17 '09 at 01:04

score 8 · Answer 2 · answered Oct 17 '08 at 11:55

8

Regexs that catch most (but not all) common error are relatively easy to setup and deploy. Takes longer to write a custom parser.

answered Oct 17 '08 at 11:55

Brian Knoblauch

20,639
15
57
92

So the basic argument here is that simple/trivial email validation is more easily completed in regexp. That I buy. But why, then, are so many people trying to perform complete validation with regexp when it's obviously harder and takes longer to develop, maintain, and understand? – Adam Davis Oct 17 '08 at 12:01
1

The primary problem with regex validation tends to not be that errors slip through, but rather that they're usually overly-restrictive and insist that some classes of valid, working, RFC-compliant addresses are "invalid" and refuse to accept them. – Dave Sherohman Oct 17 '08 at 12:39

score 8 · Answer 3 · edited Feb 08 '17 at 14:07

The temptation of using RegExp, once you've mastered the basics, is very big. In fact, RegExp seems so powerful that people naturally want to start using it everywhere. I really suspect that there's a lot of psychology involved here, as demonstrated by Randall's XKCD comic (and yes, it is useful).

I've done an introductory presentation on RegExp once and the most important slide warned against its overuse. It was the only slide that used bold font. I believe this should be done more often.

Everybody stand back!

score 4 · Answer 4 · answered Oct 17 '08 at 11:59

Using regular expressions for this is not a good idea, as has been demonstrated at length in those other posts.

I suppose people keep doing it because they don't know any better or don't care.

Will a parser be any better? Maybe, maybe not.

I maintain that sending a verification e-mail is the best way to validate it. If you want to check anything from JavaScript, then check that it has an '@' sign in there and something before and after it. If you go any stricter than that, you risc running up against some syntax you didn't know about and your validator will become overly restrictive.

Also, be careful with that TLD validation scheme of yours, you might find that you are assuming too much about what is allowed in a TLD.

score 3 · Answer 5 · answered Oct 17 '08 at 12:43

People do it because in most languages it is way easier to write regexp than to write and use a parser in your code (or so it seems, at least).

If you decide to eschew regexes, you will have to either write parsers by hand, or you resort to external tools (like yacc) for lexer/parser generation. This is way more complex than single-line regex match.

One need to have a library that makes it easy to write parsers directly in the language X (where 'X' is C, C++, C#, Java) to be able to build custom parsers with the same ease as regular expression matchers.

Such libraries originated in the functional land (Haskell and ML), but nowadays "parser combinators libraries" exist for Java, C++, C#, Scala and other mainstream languages.

Michael Carman · Answer 6 · 2008-10-17T13:22:00.937

People use regexes for email addresses, HTML, XML, etc. because:

It looks like they should work and they often do work for the obvious cases.
They "know" regular expressions. When all you have is a hammer all your problems look like nails.
Writing a parser is harder (or seems harder) than writing a regular expression. In particular, writing a parser is harder than writing a regex that handles the obvious cases in #1.
They don't understand the full complexity of the task.
They don't understand the limitations of regular expressions.
They start with a regex that handles the obvious cases and then try to extend it to handle others. They get locked into one approach.
They aren't aware that there's (probably) a library available to do the work for them.

score 3 · Answer 7 · answered Oct 17 '08 at 13:51

and then validates those against the valid characters allowed for name (there's no further check that can be done on this portion)

This is not true. For example, "ben..doom@gmail.com" contains only valid characters in the name section, but is not valid.

In languages that do not have libraries for email validation, I generally use regex becasue

I know regex, and find it easy to use
I have many friends who know regex, and I can collaborate with
It's fast for me to code, and me-time is more expensive than processor-time for most applications
For the majority of email addresses, it works.

I'm sure many built-in libraries do use your approach, and if you want to cover all the possibilities, it does get ridiculous. However, so does your parser. The formal spec for email addresses is absurdly complex. So, we use a regex that gets close enough.

score 3 · Answer 8 · answered Feb 26 '09 at 16:53

I don't believe correct email validation can be done with a single regular expression (now there's a challenge!). One of the issues is that comments can be nested to an arbitrary depth in both the local part and the domain.

If you want to validate an address against RFCs 5322 and 5321 (the current standards) then you'll need a procedural function to do so.

Fortunately, this is a commodity problem. Everybody wants the same result: RFC compliance. There's no need for anybody to write this code ever again once it's been solved by an open source function.

Check out some of the alternatives here: http://www.dominicsayers.com/isemail/

If you know of another function that I can add to the head-to-head, let me know.

score 2 · Answer 9 · answered Apr 21 '09 at 23:17

We're just looking for a fast way to see if the email address is valid so that we can warn the user they have made a mistake or prevent people from entering junk easily. Going off to the mail server and fingering it is slow and unreliable. The only real way to be sure is to get a confirmation email, but the problem is only to give a fast response to the user before the confirmation process takes place. That's why it's not so important to be strictly compliant. Anyway, it's a challenge and it's fun.

score 1 · Answer 10 · answered Oct 17 '08 at 12:07

People write regular expressions because most developers like so solve a simple problem in the most "cool" en "efficient" way (which means that it should be as unreadable as possible).

In Java, there are libraries to check if a String represents an email address without you having to know anything about regular expressions. These libraries should be available for other languages aswel.

Like Jamie Zawinski said in 1997: "Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems."

score 1 · Answer 11 · answered Apr 21 '09 at 23:24

1

On factor: the set of people who understand how to write a regular expression is very much larger than the set of people who understand the formal constraints on regular languages. Same goes for non-regular "regular expressions".

answered Apr 21 '09 at 23:24

simon

7,044
2
28
30

score -3 · Answer 12 · answered Oct 17 '08 at 11:51

-3

Regexps are much faster to use, of course, and they only validate what's specified in the RFC. Write a custom parser? What? It takes 10 seconds to use a regexp.

answered Oct 17 '08 at 11:51

Terminus

902
2
10
21

Take another look at one of the multi-thousand character regexes needed to come close to actually conforming with RFC (2)822 before calling it "fast" (when accurate) or accurate (when fast). – Dave Sherohman Oct 17 '08 at 11:55
Read before commenting blindly. Particulary think about what "fast" means. – Terminus Oct 17 '08 at 12:02
-1, since (as has been pointed out numerous times), regexes are in fact *not* able to match all addresses as laid down by the spec. – unwind Oct 17 '08 at 12:05
"Faster to use" reads to me that it executes quickly. Given one of your other comments, I now gather that you intended to say that it was quick to write, to which I can only respond that it takes me no longer to add a call to a library's parser than to copy/paste a regex. – Dave Sherohman Oct 17 '08 at 12:35
But the thread is not about using an existing parser -- I think everyone agrees that a custom email validator is better in every way that a regexp. The point is using existing regexp vs. *IMPLEMENTING* custom parser. At least this is how I see the question. – Terminus Oct 17 '08 at 12:41
Writing a regex for email parsing will take a little longer than 10 seconds ;-) – johnstok Oct 17 '08 at 13:29

Why are people using regexp for email and other complex validation?

12 Answers12

Linked

Related