Grep Regex to match emails with single top level domains

Question

I am writing a bash file that matches emails using regex. But I only want to match emails with single top level domain NOT emails with multiple ones.

For example those emails should match:

example@example.com
example@gmail.com
john@smith.org

But this email should NOT match because it has 2 top level domains .co.fr

example@example.co.fr

I tried the following:

grep -E -o '[A-Za-z0-9.]+@[A-Za-z0-9-]+\.[A-Za-z]{2,}(?!\.[A-Za-z])' log.txt > mails.txt

But the (?!\.[A-Za-z]) part is not working with bash, my understanding that it negates the match if it finds a second domain after the first dot.

it's working fine when I try it on online tools: https://regex101.com/r/H4ftC3/1

I also tried use $ at the end: [A-Za-z0-9.]+@[A-Za-z0-9-]+\.[A-Za-z]{2,}$ but this one doesn't match anything.

How can I match only single top level domains?

Thanks

Note local part (to left of `@`) can be much more complicated than letters, digits and period. — jhnc, Mar 12 '23 at 21:58
This should work: `^[^@]+@[^.]+\.[^.]+$` https://regex101.com/r/IrTwy6/1 — Jerry Jeremiah, Mar 13 '23 at 02:00
See http://www.regular-expressions.info/email.html for a regexp that matches emails. You may need to tweak it based on the tool you're using for that matching. I personally use `(addr ~ /^[0-9a-zA-Z._%+-]+@[0-9a-zA-Z.-]+\.[a-zA-Z]{2,}$/) && (addr ~ /^.*[a-zA-Z]{2}.*@.*[a-zA-Z]{2}.*\.[a-zA-Z]{2,}$/)` in awk with the second check just to get rid of strings like `x@y.co` that are technically valid email addresses but are more likely just noise in the input. It could probably be done in one regexp but I was lazy, — Ed Morton, Mar 13 '23 at 13:53

Jayr Magave · Accepted Answer · 2023-03-13T03:51:13.073

tl;dr

grep -P -i "[\w+-.]+@[\w+-]+\.[a-z]{2,}$" file.txt 

-P - option for advanced Perl like regex (allows using \w)
-i - ignore case (matches @xyz.com ou @xyz.COM)

For input file: file.txt

    example@gmail.com
    john@smith.org
    example@example.co.gr
    john@zcal.co
    john.smith@zcal.co
    john.smith@zcal.com.br
    john.de-smith@zcal.com.br
    john.de-smith@hotmail.com
    john_de_smith@hotmail.com
    john.de-smith@hotmail.AG
    john.de-smith@dept.hotmail.AG

Resulting:

    example@gmail.com
    john@smith.org
    john@zcal.co
    john.smith@zcal.co
    john.de-smith@hotmail.com
    john_de_smith@hotmail.com
    john.de-smith@hotmail.AG

No fancy characters, please.

In order to answer your question it's important to make some assumptions.

E-mails regex are tricky, and you already read this answer on Stackoverflow (1), as well as this article on Wikipedia (2).
Your e-mails local part (a.k.a. user name) only have the following characters: letter from A-Za-z, numbers from 0 to 9, special characters +-_ (a very reduced of the allowed set), and dot . in the middle.
No fancy utf-8 or utf-16 characters. Not even latin ones (e.g. ç, ñ)

This assumption represents 99,73% of all e-mail addresses known so far.

Allowed chars

username_allowed_chars = [A-Za-z0-9_+-.]

In fact, I assume you're using gnu grep, therefore you may use grep -P (perl style regex) and the following set \w which is equivalent to [A-Za-z0-9_], thence:

username_allowed_chars = [\w+-.]

As for the domain part, remove + and dot ., thence:

domain_allowed_chars = [\w-]

Finally we will use + for 1 or more repetitions of chars.

grep -P -i "[\w+-.]+@[\w+-]+\.[a-z]{2,}$" file.txt

I'll break this regex in parts. First the character set \w that is used extensively.

\w - Translates do [A-Za-z0-9_] word indentifier a.k.a. allowed chars for variable names, in programming parlance. In practice disallows punctuations and other unusual characters in e-mail user name;
\. - literal dot .;
[\w+-.]+- One or more of these identifiers, and includes the period or dot in user names. e.g. john.doe@gmail.com.
@ - literal @ to separate username from domain name.
[a-z]{2,}$ - No less than two lowercase letters up to the end of the string (marked by $).

References

(1) Stack Overflow

(2) Wikipedia

Nor RFC compliant. The left part of `@` can be far more complicated — Gilles Quénot, Mar 13 '23 at 10:25

Gilles Quénot · Answer 2 · 2023-03-13T10:28:24.460

0

Input file

$ cat file
example@example.com
example@gmail.com
john@smith.org
example@example.co.fr

With `grep`:

grep -v '@.*\..*\.' file
example@example.com
example@gmail.com
john@smith.org

-v invert the matches

The regular expression matches as follows:

Regex	Description
`@`	Match the character “@” literally
`.*`	Match a any character except newline Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
`\.`	Match the character “.” literally
`.*`	see upper
`\.`	Match the character “.” literally

edited Mar 13 '23 at 10:28

answered Mar 12 '23 at 21:53

Gilles Quénot

173,512
41
224
223

I think you need `LANG=C grep ...` to ensure `[\w-]` has that meaning – jhnc Mar 12 '23 at 22:00
Why ? UTF8 should be allowed AFAIK – Gilles Quénot Mar 12 '23 at 22:01
There's so many characters allowed in [`RFC 2822`](https://www.rfc-editor.org/rfc/rfc2822) – Gilles Quénot Mar 12 '23 at 22:18
my comment related to the description you provided, not to what characters are allowed in a domain name (but see https://www.rfc-editor.org/rfc/rfc5322#section-3.4.1 and https://www.rfc-editor.org/rfc/rfc5321#section-2.3.5 ) – jhnc Mar 12 '23 at 22:31
1

Nitpicking here =) Added better explanation – Gilles Quénot Mar 12 '23 at 22:43

score 0 · Answer 3 · answered Mar 12 '23 at 23:02

0

Using grep

$ grep -E '^\S+@\w+\.\w+$' input_file
example@example.com
example@gmail.com
john@smith.org

View demo

answered Mar 12 '23 at 23:02

HatLess

10,622
5
14
32

Nor RFC compliant. The left part of `@` can be far more complicated – Gilles Quénot Mar 13 '23 at 10:25
@GillesQuénot If it is more complicated than `\S+` can handle, then OP can show it in question so it can be addressed. – HatLess Mar 13 '23 at 11:09
@GillesQuénot In fact my answer is not RFC compliant. That's why I assumed a reduced set of characters, only \w eq. [A-Za-z0-9_] and [.+-], instead of [A-Za-z0-9_.!#$%&'*+-/=?^_`{|}~], that are RFC compliant but very rare in practice. Sometimes pi = 3.1416 instead of pi = π , for that matter. – Jayr Magave Mar 14 '23 at 21:40