tl;dr
grep -P -i "[\w+-.]+@[\w+-]+\.[a-z]{2,}$" file.txt
-P - option for advanced Perl like regex (allows using \w)
-i - ignore case (matches @xyz.com ou @xyz.COM)
For input file: file.txt
example@gmail.com
john@smith.org
example@example.co.gr
john@zcal.co
john.smith@zcal.co
john.smith@zcal.com.br
john.de-smith@zcal.com.br
john.de-smith@hotmail.com
john_de_smith@hotmail.com
john.de-smith@hotmail.AG
john.de-smith@dept.hotmail.AG
Resulting:
example@gmail.com
john@smith.org
john@zcal.co
john.smith@zcal.co
john.de-smith@hotmail.com
john_de_smith@hotmail.com
john.de-smith@hotmail.AG
No fancy characters, please.
In order to answer your question it's important to make some assumptions.
- E-mails regex are tricky, and you already read this answer on Stackoverflow (1), as well as this article on Wikipedia (2).
- Your e-mails local part (a.k.a. user name) only have the following characters: letter from
A-Za-z
, numbers from 0
to 9
, special characters +-_
(a very reduced of the allowed set), and dot .
in the middle.
- No fancy
utf-8
or utf-16
characters. Not even latin ones (e.g. ç
, ñ
)
This assumption represents 99,73% of all e-mail addresses known so far.
Allowed chars
username_allowed_chars = [A-Za-z0-9_+-.]
In fact, I assume you're using gnu grep, therefore you may use grep -P
(perl style regex) and the following set \w
which is equivalent to [A-Za-z0-9_]
, thence:
username_allowed_chars = [\w+-.]
As for the domain part, remove +
and dot .
, thence:
domain_allowed_chars = [\w-]
Finally we will use +
for 1 or more
repetitions of chars.
grep -P -i "[\w+-.]+@[\w+-]+\.[a-z]{2,}$" file.txt
I'll break this regex in parts. First the character set \w
that is used extensively.
\w
- Translates do [A-Za-z0-9_]
word indentifier a.k.a. allowed chars for variable names, in programming parlance. In practice disallows punctuations and other unusual characters in e-mail user name;
\.
- literal dot .
;
[\w+-.]+
- One or more of these identifiers, and includes the period or dot in user names. e.g. john.doe@gmail.com
.
@
- literal @
to separate username from domain name.
[a-z]{2,}$
- No less than two lowercase letters up to the end of the string (marked by $
).
References
(1) Stack Overflow
(2) Wikipedia