0

I am trying to validate emails (UTF8) using the following regular expression

Regex.IsMatch(emailAddress, @"^([\w-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([\w-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$", RegexOptions.CultureInvariant);

It returns false for "äpfel@domain.com".

Any suggestions on how to improve it.

Leri
  • 12,367
  • 7
  • 43
  • 60
  • 1
    See http://stackoverflow.com/questions/201323/using-a-regular-expression-to-validate-an-email-address – StevieB Jan 17 '14 at 09:03
  • 2
    Email validation with regex is more complicated than it seems: http://stackoverflow.com/a/201378/1283847 – Leri Jan 17 '14 at 09:04
  • @StevieB thanks I have already gone through the link. But I thought setting the option to "RegexOptions.CultureInvariant" and specifying "\w" in regex will validate all UTF8 words. – user3205838 Jan 17 '14 at 09:10

2 Answers2

0

The simple answer is that you don't want to do this: regular expressions are a horrible way of validating email addresses.

The answer to your specific question is that, if you are willing to block valid addresses and permit invalid ones, you want to use [\p{L}\p{M}\p{N}] rather than \w to match Unicode word characters in the username part of the address.

Mark
  • 2,792
  • 2
  • 18
  • 31
  • Why do you think regex are a horrible way of validating email addresses? What better solution do you suggest? – Thomas Levesque Jan 17 '14 at 09:13
  • The full legal syntax of email addresses makes for an incredibly complicated regex (the simplest RFC-822-compliant one I've seen is a page and a half long). If you want to check to see if an address is valid, just send an email to it and see if it bounces. – Mark Jan 17 '14 at 09:15
  • A state machine would suit this task better. Bit more verbose of course. – StevieB Jan 17 '14 at 09:19
  • @Mark, sure, but it's not scalable... It's hard to build the correct regex, but one you have it, it works pretty well. – Thomas Levesque Jan 17 '14 at 09:21
  • @StevieB, a regex just generates a state machine behind the scene ;) – Thomas Levesque Jan 17 '14 at 09:21
  • @Mark I am importing Thousands of email addresses it is not going to be feasible to send emails to verify email accounts and then output the list of incorrect email addresses to the user. – user3205838 Jan 17 '14 at 09:21
  • @Mark I tried the replacement but it still returns false for the email "äpfel@domain.com" – user3205838 Jan 17 '14 at 09:27
  • @Thomas, that may be. But a state machine gives you more fine grained control than a regex e.g. the state machine could suggest changes to make an invalid email address valid (or even auto-correct e.g. me@domain,com to me@domain.com). Try doing that with a regex :-P – StevieB Jan 17 '14 at 10:01
  • @user3205838 no, that returns true on the example address, just like your original. You're fixing the wrong bug. – Jon Hanna Jan 17 '14 at 10:22
0
  1. UTF-8 has nothing to do with this, you're validating a string, not a particular encoding thereof.

  2. Your Regex actually returns true for "äpfel@domain.com" (with or without the CultureInvariant option). Try Console.Write(Regex.IsMatch("äpfel@domain.com", @"^([\w-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([\w-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$", RegexOptions.CultureInvariant)); on its own, and you get true.

  3. You will fail on all IDNs like info@ουτοπία.δπθ.gr and if you care about non ASCiI-restricted email addresses you may want to include them. (And if you want to exclude prohibited confusables, you're getting really complicated).

There are the problems stated by others with using regular expressions to validate emails, but they boil down to:

  1. The actual email syntax is more complicated than people think (even before we deal with the non-ASCII extensions). e.g. did you know that Abc\@def@example.com is a valid email address? It is, in fact it's an example of a valid address given in RFC 3696.

  2. If you go to the effort of building a perfect validator (it is possible), it'll be a waste of effort. Chances are your email software won't handle them all (e.g. Abc\@def@example.com above won't work with a lot of software) an then lots of valid email addresses won't actually be correct.

But anyway, I get true running your code, the bug is elsewhere.

Jon Hanna
  • 110,372
  • 10
  • 146
  • 251