-2

I am trying to match email addresses in Python using regex with this pattern:

"\w{1,}@\w{1,}.\w{1,}"

However sometimes there are email addresses that look like firstname.lastname@lol.omg.hahaha.museum which my pattern will miss.

Is there a way to adjust this regex so it will include an arbitrary number of chained ".word" type patterns?

vaultah
  • 44,105
  • 12
  • 114
  • 143
user51819
  • 315
  • 5
  • 11
  • 4
    Yeah - `'@' in string`. This is the best you can do. – vaultah May 21 '15 at 16:41
  • Use `[\w.]` instead of `\w`. – Barmar May 21 '15 at 16:43
  • 1
    @user51819 - Vaultah actually makes a valid point; it is _very_ difficult to have a valid regular expression for an email (just because there are many different formats that a valid email address can take). Many applications use a simple check for the at-symbol within the string (and something after the at-symbol) – Chris Forrence May 21 '15 at 16:44
  • 1
    BTW, instead of `{1,}` you should use `+`. – Barmar May 21 '15 at 16:44
  • You can't just check for @ in string because then you'd be matching on non-email addresses like "@randomword" or "I'll meet you @7" or "@someone: hi!" or "gibberish@gibberish more gibberish" – user51819 May 21 '15 at 16:45
  • Are you trying to validate an entered email address, or search for email addresses in text? – Barmar May 21 '15 at 16:45
  • @Barmar what is "+"? – user51819 May 21 '15 at 16:46
  • @Barmar Scrape email addresses from text – user51819 May 21 '15 at 16:46
  • `+` is like `*`, but it matches 1 or more instead of 0 or more. Just like `{1,}` does. – Barmar May 21 '15 at 16:46
  • 2
    "gibberish@gibberish" is a valid email address (believe it or not!) – Chris Forrence May 21 '15 at 16:46
  • @ChrisForrence Even without a ".word" suffix?! – user51819 May 21 '15 at 16:48
  • 1
    Theoretically you can use a top-level domain for email. The owner of `.com` could create addresses like `owner@com`. Practically, no one does this, and there are probably millions of address validators that won't allow it. – Barmar May 21 '15 at 16:50
  • 1
    More likely, `gibberish` and `nonsense` could be two subdomains or hosts within the same domain, and mail between the two can drop the common suffix. Similar to (but implemented separately from?) DNS search domains. – chepner May 21 '15 at 16:52
  • I'd use `\s+(.*?@.*?\..*?)\s+`. – chris85 May 21 '15 at 16:59
  • From [html5](http://www.w3.org/TR/html5/forms.html#valid-e-mail-address) `^[a-zA-Z0-9.!#$%&'*+/=?^_\`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$` –  May 21 '15 at 17:13

3 Answers3

0

You shouldn't try to match email addresses with regex. You'll have to use a more complicated state machine to check whether the address correctly matches RFC 2822.

https://pypi.python.org/pypi/validate_email is one such library you can check out.

Santiclause
  • 870
  • 1
  • 7
  • 12
0

This should work for you

[a-zA-Z0-9._-]+@([a-zA-Z0-9.-]+\.)+[a-zA-Z0-9.-]{2,4}
Joel Fazio
  • 172
  • 8
0

You can use the following:

[\w.-]+@[\w-][\w.-]+\w   //replaced {1,} with its equivalent.. "+"
karthik manchala
  • 13,492
  • 1
  • 31
  • 55