How to improve this email regex?

Question

I am trying to match email addresses in Python using regex with this pattern:

"\w{1,}@\w{1,}.\w{1,}"

However sometimes there are email addresses that look like firstname.lastname@lol.omg.hahaha.museum which my pattern will miss.

Is there a way to adjust this regex so it will include an arbitrary number of chained ".word" type patterns?

@user51819 - Vaultah actually makes a valid point; it is _very_ difficult to have a valid regular expression for an email (just because there are many different formats that a valid email address can take). Many applications use a simple check for the at-symbol within the string (and something after the at-symbol) — Chris Forrence, May 21 '15 at 16:44
You can't just check for @ in string because then you'd be matching on non-email addresses like "@randomword" or "I'll meet you @7" or "@someone: hi!" or "gibberish@gibberish more gibberish" — user51819, May 21 '15 at 16:45
Are you trying to validate an entered email address, or search for email addresses in text? — Barmar, May 21 '15 at 16:45
`+` is like `*`, but it matches 1 or more instead of 0 or more. Just like `{1,}` does. — Barmar, May 21 '15 at 16:46
"gibberish@gibberish" is a valid email address (believe it or not!) — Chris Forrence, May 21 '15 at 16:46
Theoretically you can use a top-level domain for email. The owner of `.com` could create addresses like `owner@com`. Practically, no one does this, and there are probably millions of address validators that won't allow it. — Barmar, May 21 '15 at 16:50
More likely, `gibberish` and `nonsense` could be two subdomains or hosts within the same domain, and mail between the two can drop the common suffix. Similar to (but implemented separately from?) DNS search domains. — chepner, May 21 '15 at 16:52
From [html5](http://www.w3.org/TR/html5/forms.html#valid-e-mail-address) `^[a-zA-Z0-9.!#$%&'*+/=?^_\`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$` — , May 21 '15 at 17:13

score 0 · Answer 1 · answered May 21 '15 at 16:45

0

You shouldn't try to match email addresses with regex. You'll have to use a more complicated state machine to check whether the address correctly matches RFC 2822.

https://pypi.python.org/pypi/validate_email is one such library you can check out.

answered May 21 '15 at 16:45

Santiclause

870
1
7
12

According to his comment, he's not doing validation, he's scraping. – Barmar May 21 '15 at 16:47

score 0 · Answer 2 · answered May 21 '15 at 16:48

0

This should work for you

[a-zA-Z0-9._-]+@([a-zA-Z0-9.-]+\.)+[a-zA-Z0-9.-]{2,4}

answered May 21 '15 at 16:48

Joel Fazio

172
8

That will miss addresses that I use. But I'm okay with that. – Ignacio Vazquez-Abrams May 21 '15 at 16:48
Any reason you don't use `\w` instead of `a-zA-Z0-9_`, like his original regexp does? – Barmar May 21 '15 at 16:48
Can you do \w+ instead of [a-zA-Z0-9._-]+? – user51819 May 21 '15 at 16:49
@user51819 `\w` is short for letters, numbers, and underscore. It doesn't include dot or dash. That's why your original regexp doesn't work. – Barmar May 21 '15 at 16:51
I've found that `\w` sometimes includes some weird chars that would not be valid in an email address. I suppose it depends on what language you are using. – Joel Fazio May 21 '15 at 16:52

karthik manchala · Accepted Answer · 2015-05-21T17:00:52.297

0

You can use the following:

[\w.-]+@[\w-][\w.-]+\w   //replaced {1,} with its equivalent.. "+"

edited May 21 '15 at 17:00

answered May 21 '15 at 16:50

karthik manchala

13,492
1
31
55

@Barmar yes I did.. but I dont see user asking for it? – karthik manchala May 21 '15 at 16:53
Won't this incorrectly pick up periods that are at the ends of sentences? For example "my email is myname@somewhere.com." – user51819 May 21 '15 at 16:54
I am just trying to understand the structure. What does "[\w.-]" by itself mean? a-zA-Z0-9_ character or period or hiphen? – user51819 May 21 '15 at 16:56
@user51819 it means match any of the characters in the set `[a-zA-Z0-9_.-]` – karthik manchala May 21 '15 at 16:58
So that last piece forces the last character to be a-zA-Z0-9_, cool! This regex was informative, thanks – user51819 May 21 '15 at 16:59
@user51819 `\w` has the same meaning inside brackets as it does outside brackets. – Barmar May 21 '15 at 16:59
@Barnar Do you mean something like [\w.-]+@[\w-][\w.-]+\w – user51819 May 21 '15 at 17:00
@karthikmanchala Is there a way to negate something? Like if you want to say "I don't want this to be a \w" – user51819 May 21 '15 at 17:08
@user51819 you can use negated set `[^ ]` i.e .. in this case `[^\w]` – karthik manchala May 21 '15 at 17:12
What about "(?<!\w)"? – user51819 May 21 '15 at 17:23
@user51819 it means `negative lookbehind` and it matches positions where previous character is not `\w`.. – karthik manchala May 21 '15 at 17:27

How to improve this email regex?

3 Answers3