0

Possible Duplicate:
Using a regular expression to validate an email address

This is homework, I've been working on it for a while, I've done lots of reading and feel I have gotten pretty familiar with regex for a beginner.

I am trying to find a regular expression for validating/invalidating a list of emails. There are two addresses which are giving me problems, I can't get them both to validate the correct way at the same time. I've gone through a dozen different expressions that work for all the other emails on the list but I can't get those two at the same time.

First, the addresses.

me@example..com  - invalid
someone.nothere@1.0.0.127  - valid

The part of my expression which validates the suffix

I originally started with

@.+\\.[[a-z]0-9]+

And had a second pattern for checking some more invalid addresses and checked the email against both patterns, one checked for validity the other invalidity but my professor said he wanted it all in on expression.

@[[\\w]+\\.[\\w]+]+

or

@[\\w]+\\.[\\w]+

I've tried it written many, many different ways but I'm pretty sure I was just using different syntax to express these two expressions.

I know what I want it to do, I want it to match a character class of "character+"."character+"+

The plus sign being at least one. It works for the invalid class when I only allow the character class to repeat one time(and obviously the ip doesn't get matched), but when I allow the character class to repeat itself it matches the second period even thought it isn't preceded by a character. I don't understand why.

I've even tried grouping everything with () and putting {1} after the escaped . and changing the \w to a-z and replacing + with {1,}; nothing seems to require the period to surrounded by characters.

Community
  • 1
  • 1
  • You can't nest character classes like that; Have a bit more of a read about character classes to understand what one means, and then have a look at subpatterns. – cmbuckley Oct 17 '12 at 22:57

4 Answers4

0

You need a negative look-ahead :

@\w+\.(?!\.)

See http://www.regular-expressions.info/lookaround.html

test in Perl :

Perl> $_ = 'someone.nothere@1.0.0.127'
someone.nothere@1.0.0.127

Perl> print "OK\n" if /\@\w+\.(?!\.)/
OK
1

Perl> $_ = 'me@example..com'
me@example..com

Perl> print "OK\n" if /\@\w+\.(?!\.)/

Perl> 
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
  • 1
    I want to see him explain that one to his professor – Mike Park Oct 17 '12 at 23:01
  • Sorry its java, don't know how much that changes things, that compiles but makes everything invalid – user1754700 Oct 17 '12 at 23:10
  • As we can see in this post http://stackoverflow.com/questions/11817249/regex-lookaround-construct-in-java-advise-on-optimization-needed it's possible to use look-around in Java... – Gilles Quénot Oct 17 '12 at 23:15
  • Ok, when I make it \\w+\\.(?!\\.)\\w+ it makes the invalid address invalid but it still won't accept the ip address. If I make the whole thing a character class of at least one then both addresses will be valid again. – user1754700 Oct 17 '12 at 23:20
  • @(\\w+\\.(?!\\.))+\\w+ works, thanks a bunch, I can see that negative lookahead will be very useful – user1754700 Oct 17 '12 at 23:22
  • If so, on stackoverflow we "upvote" answers. We "accept" it when the answer are what we expected. – Gilles Quénot Oct 18 '12 at 00:03
  • @user1754700 - I'd recommend only using look-aheads and look-behinds when you really need to. They make the regex harder to understand, and (potentially) they limit the regex engine's ability to optimize. – Stephen C Oct 18 '12 at 00:05
0
@([\\w]+\\.)+[\\w]+

Matches at least one word character, followed by a '.'. This is repeated at least once, and is then followed by at least on more word character.

Dallin
  • 600
  • 3
  • 11
  • That solution also works, thanks Dallin. I had tried something just like that, but I mistakenly tried to nest a character class instead of just using grouping... like @[[\\w]+\\.]+[\\w]+... – user1754700 Oct 17 '12 at 23:35
0

I think you want this:

@[\\w]+(\\.[\\w]+)+

This matches a "word" followed by one or more "." "word" sequences. (You can also do the grouping the other way around; e.g. see Dailin's answer.)

The problem with what you are doing before was that you were trying to embed a repeat inside a character class. That doesn't make sense, and there is no syntax that would support it. A character class defines a set of characters and matches against one character. Nothing more.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
0

The official standard RFC 2822 describes the syntax that valid email addresses with this regular expression:

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

More practical implementation of RFC 2822 (if we omit the syntax using double quotes and square brackets), which will still match 99.99% of all email addresses in actual use today, is:

[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
Community
  • 1
  • 1
Ωmega
  • 42,614
  • 34
  • 134
  • 203