regex for email parsing in python

Question

i'm asked to write regular expression which can catch multi-domain email addresses and implement it in python. so i came up with the following regular expression (and code;the emphasis is on the regex though), which i think is correct:

import re
regex = r'\b[\w|\.|-]+@([\w]+\.)+\w{2,4}\b'
input_string = "hey my mail is abc@def.ghi"
match=re.findall(regex,input_string)
print match

now when i run this (using a very simple mail) it doesn't catch it!! instead it shows an empty list as the output. can somebody tell me where did i go wrong in the regular expression literal?

Just google this! There's loads of content on it. Even on SO. — Mmm Donuts, Jan 07 '16 at 03:43
Possible duplicate of [Using a regular expression to validate an email address](http://stackoverflow.com/questions/201323/using-a-regular-expression-to-validate-an-email-address) — ShadowRanger, Jan 07 '16 at 03:47
i know that there are tons of copy-and-paste email regular expressions out there, but this problem is hurting my brain; knowing that the regex is right, yet it's not working. — mzajy, Jan 07 '16 at 03:48
a) It doesn't output an empty string, it outputs `['def.']` (which is the only bit you're capturing with `()`). b) the regex is not right, you can't use `|` like that inside a character class - inside `[]` it matches the pipe character literally, it doesn't do either-or, and `\b` doesn't match at the end of a string, and the regex is broken for addresses like `example@example.google` which don't have a 2-4 digit TLD. — TessellatingHeckler, Jan 07 '16 at 03:51
And on top of @TessellatingHeckler's notes, if you have capturing groups, `findall` returns the capture groups, not the full match. Change `([\w]+\.)` to `(?:\w+\.)` to change the parens to non-capturing (also removing the superfluous but harmless brackets; `\w` is a character class by itself). — ShadowRanger, Jan 07 '16 at 03:53
you just need to make `([\w]+\.)+` to non-capturing group `(?:[\w]+\.)+`, because if you specify a capture, findall will extract that part only. — YOU, Jan 27 '16 at 07:29

score 1 · Answer 1 · answered Jan 07 '16 at 04:02

Here's a simple one to start you off with

regex = r'\b[\w.-]+?@\w+?\.\w+?\b'
re.findall(regex,input_string)  # ['abc@def.ghi']

The problem with your original one is that you don't need the | operator inside a character class ([..]). Just write [\w|\.|-] as [\w.-] (If the - is at the end, you don't need to escape it).

Next there are way too many variations on legitimate domain names. Just look for at least one period surrounded by word characters after the @ symbol:

@\w+?\.\w+?\b

regex for email parsing in python

1 Answers1