0

I've never really thought I'll need help with regular expressions, but here it goes:

I am looking for a single regular expression for parsing e-mail addresses:

  • someone@example.com{"name": "", "email": "someone@example.com"}
  • Some One <someone@example.com>{"name": "Some One", "email": "someone@example.com"}

The regex has to produce two groups: name and email.

This is my current status:

regex = r"^((?P<name>[^(\s+\<)]*)\s+\<)?(?P<email>[^@]+?@[^>]+)>?$"

I am absolutely sure that I need to escape something within the first nested block, because this is an actual result:

{'email': 'Some One <someone@example.com', 'name': None}

EDIT: forgot to put * in regex (doesn't answer the question)
EDIT2: solved. Thanks everyone for your help.
EDIT3: renamed SO: quote → escape

Community
  • 1
  • 1
Tomas Tomecek
  • 6,226
  • 3
  • 30
  • 26
  • possible duplicate of [Using a regular expression to validate an email address](http://stackoverflow.com/questions/201323/using-a-regular-expression-to-validate-an-email-address) – Conspicuous Compiler Mar 25 '15 at 18:49
  • @ConspicuousCompiler it's not a dupe; I'm fine with `[^@]+@.+` – Tomas Tomecek Mar 25 '15 at 18:52
  • I'm afraid the way I see it, one way or the other, this question is not likely to help future SO users. Either this question is asking for help with a typo in a regular expression, which is unlikely to help future searchers, or this question is looking for an authoritative "how to regexp match an email address" which is answered in the dupe link. – Conspicuous Compiler Mar 25 '15 at 18:57
  • I think it is helpful for future readers, because it is *really* about quoting within nested brackets and parentheses; [see my answer](http://stackoverflow.com/a/29264299/909579) – Tomas Tomecek Mar 25 '15 at 19:05
  • It's virtually impossible to match an email address with a regex. See http://stackoverflow.com/q/201323/372239 – Toto Mar 25 '15 at 19:39

2 Answers2

2

"Regular" expressions are called that because they specify what is known as "regular languages". This category of languages is characterized by context-free rules; for example, the word "bow" means one thing only, regardless of which words it is surrounded by (let's say it's a keyword for "what dogs say"). This is distinct from context-dependent languages, where "bow" in "I bow before you" is different from "shoot with a bow" which is different from "bow wow".

Modern regular expressions somewhat transcend this definition, but nevertheless, the From: header syntax defined in RFC822 is too complex to be parsed by even a souped-up regular expression engine. You really, properly need a context-dependent grammar (and indeed, RFC5322 includes one) to completely parse every possible variation allowed by the specification. To connect to the previous example, what \" means (i.e. how it should be parsed) depends on whether you are inside double quotes or not, and whether or not you are looking at the "real name", the email terminus, or a comment (in parentheses).

Now, you might want to back off, and say that only some of the possible variations are actually in common, widespread use; that's true, and there are regular expressions which handle almost all of them.

Try your regular expression on the test suite at http://code.iamcal.com/php/rfc822/tests/ and decide for yourself which of those test failures actually matter to you. Maybe you can come up with a good spec for what you "really mean". But your question, as it stands, has to be answered with a resounding "it cannot be done".

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • I've clearly provided inputs and expected outputs. I am **not** looking for a regex for parsing e-mail addresses. I already have those. I just need to extract name and email. – Tomas Tomecek Mar 26 '15 at 06:39
  • You are not making sense. Determining which part of the input string is the name and which part is the email address is what parsing means. If you have already solved that, why are you asking? – tripleee Mar 26 '15 at 07:44
  • because I can't accept my own answer yet (my original post already contains link to solution and word 'solved') – Tomas Tomecek Mar 26 '15 at 08:44
  • That still doesn't make sense. Your question says "I am looking for a single regular expression for parsing e-mail addresses" and now you are saying you are not? – tripleee Mar 26 '15 at 08:52
  • For your entertainment, https://gist.github.com/tripleee/93d9c4c152e99fa4d976 contains a simple validator script which uses the test cases from the URL above. – tripleee Mar 26 '15 at 09:25
0

There was an answer here for a couple of seconds (and then the OP deleted it) which had the answer inside:

You need to double escape.

regex = r"^((?P<name>[^(\\s+\<)]*)\s+<)?(?P<email>[^@]+?@[^>]+)>?$"
                        ↑   ↑

EDIT: quote → escape

EDIT2:
this regex works much better:

r'^\s*(?P<name>[^\s<>](?:.*?[^\s<>])?)??\s*<?(?P<email>[^<>@\s]+@[^<>@\s]+)>?$'‌​

Thanks @tripleee

Community
  • 1
  • 1
Tomas Tomecek
  • 6,226
  • 3
  • 30
  • 26
  • 1
    This specifies a character class which matches one character which is not opening round parenthesis, literal backslash, the character `s`, the character `+`, the character `<` (needlessly escaped with a backslash) or the closing round parenthesis; the class can be matched zero or more times (so it's really not meaningful at all). The fact that this change allowed you to parse your single test case is coincidental. – tripleee Mar 25 '15 at 19:34
  • @tripleee this is very interesting; in python, regex `[^()]*` matches characters until subpattern in parentheses matches (left parenthesis is interpreted) – Tomas Tomecek Mar 26 '15 at 07:05
  • Also FWIW, at higher reputation levels (10k+) we can see the deleted answers. There is no deleted answer with this suggestion. – tripleee Mar 26 '15 at 07:44
  • @tripleee the answer had escaped groups within parentheses (e.g. `\\s`), that's how I figured it out – Tomas Tomecek Mar 26 '15 at 08:42
  • Those were because the deleted answer used regular `'...'` strings rather than raw `r'...'` strings, so the backslashes needed to be escaped just to be taken as literal backslashes in the regex. – tripleee Mar 26 '15 at 08:48
  • Incidentally, editing your question to add "solved" is not recommended practice on StackOverflow. Just accept the answer, once you are able to. – tripleee Mar 26 '15 at 08:55
  • 1
    Anyway, this trivially fails on `Lastname First ` and on `
    ` with no real name but brokets around the email terminus.
    – tripleee Mar 26 '15 at 09:12
  • 1
    Also fails on `First Middle Last ` and `Gregorius `. – tripleee Mar 26 '15 at 09:35
  • 1
    If I am able to fathom what you are attempting, I'm guessing `r'^\s*(?P[^\s<>](?:.*?[^\s<>])?)??\s*(?P[^<>@\s]+@[^<>@\s]+)>?$'` might be more or less what you tried to accomplish. That will still fail on addresses like `oldfashioned@example.net (Old Fashioned)` but those are not very popular any longer. – tripleee Mar 26 '15 at 09:52
  • @tripleee wow, you are right! my regex indeed blowed on some of the inputs while your worked just fine; can you post it as an answer so I can accept it? Thanks a lot! – Tomas Tomecek Mar 27 '15 at 13:16
  • I already posted my answer and am not inclined to change it. Feel free to update your own answer with this regex instead, though. – tripleee Mar 27 '15 at 13:18