0

I am writing a bash file that matches emails using regex. But I only want to match emails with single top level domain NOT emails with multiple ones.

For example those emails should match:

example@example.com
example@gmail.com
john@smith.org

But this email should NOT match because it has 2 top level domains .co.fr

example@example.co.fr

I tried the following:

grep -E -o '[A-Za-z0-9.]+@[A-Za-z0-9-]+\.[A-Za-z]{2,}(?!\.[A-Za-z])' log.txt > mails.txt

But the (?!\.[A-Za-z]) part is not working with bash, my understanding that it negates the match if it finds a second domain after the first dot.

it's working fine when I try it on online tools: https://regex101.com/r/H4ftC3/1

I also tried use $ at the end: [A-Za-z0-9.]+@[A-Za-z0-9-]+\.[A-Za-z]{2,}$ but this one doesn't match anything.

How can I match only single top level domains?

Thanks

Y2theZ
  • 10,162
  • 38
  • 131
  • 200
  • 2
    Note local part (to left of `@`) can be much more complicated than letters, digits and period. – jhnc Mar 12 '23 at 21:58
  • This should work: `^[^@]+@[^.]+\.[^.]+$` https://regex101.com/r/IrTwy6/1 – Jerry Jeremiah Mar 13 '23 at 02:00
  • See http://www.regular-expressions.info/email.html for a regexp that matches emails. You may need to tweak it based on the tool you're using for that matching. I personally use `(addr ~ /^[0-9a-zA-Z._%+-]+@[0-9a-zA-Z.-]+\.[a-zA-Z]{2,}$/) && (addr ~ /^.*[a-zA-Z]{2}.*@.*[a-zA-Z]{2}.*\.[a-zA-Z]{2,}$/)` in awk with the second check just to get rid of strings like `x@y.co` that are technically valid email addresses but are more likely just noise in the input. It could probably be done in one regexp but I was lazy, – Ed Morton Mar 13 '23 at 13:53

3 Answers3

1

tl;dr

grep -P -i "[\w+-.]+@[\w+-]+\.[a-z]{2,}$" file.txt 

-P - option for advanced Perl like regex (allows using \w)
-i - ignore case (matches @xyz.com ou @xyz.COM)

For input file: file.txt

    example@gmail.com
    john@smith.org
    example@example.co.gr
    john@zcal.co
    john.smith@zcal.co
    john.smith@zcal.com.br
    john.de-smith@zcal.com.br
    john.de-smith@hotmail.com
    john_de_smith@hotmail.com
    john.de-smith@hotmail.AG
    john.de-smith@dept.hotmail.AG

Resulting:

    example@gmail.com
    john@smith.org
    john@zcal.co
    john.smith@zcal.co
    john.de-smith@hotmail.com
    john_de_smith@hotmail.com
    john.de-smith@hotmail.AG

No fancy characters, please.

In order to answer your question it's important to make some assumptions.

  1. E-mails regex are tricky, and you already read this answer on Stackoverflow (1), as well as this article on Wikipedia (2).
  2. Your e-mails local part (a.k.a. user name) only have the following characters: letter from A-Za-z, numbers from 0 to 9, special characters +-_ (a very reduced of the allowed set), and dot . in the middle.
  3. No fancy utf-8 or utf-16 characters. Not even latin ones (e.g. ç, ñ)

This assumption represents 99,73% of all e-mail addresses known so far.

Allowed chars

username_allowed_chars = [A-Za-z0-9_+-.]

In fact, I assume you're using gnu grep, therefore you may use grep -P (perl style regex) and the following set \w which is equivalent to [A-Za-z0-9_], thence:

username_allowed_chars = [\w+-.]

As for the domain part, remove + and dot ., thence:

domain_allowed_chars = [\w-]

Finally we will use + for 1 or more repetitions of chars.

grep -P -i "[\w+-.]+@[\w+-]+\.[a-z]{2,}$" file.txt 

I'll break this regex in parts. First the character set \w that is used extensively.

  • \w - Translates do [A-Za-z0-9_] word indentifier a.k.a. allowed chars for variable names, in programming parlance. In practice disallows punctuations and other unusual characters in e-mail user name;
  • \. - literal dot .;
  • [\w+-.]+- One or more of these identifiers, and includes the period or dot in user names. e.g. john.doe@gmail.com.
  • @ - literal @ to separate username from domain name.
  • [a-z]{2,}$ - No less than two lowercase letters up to the end of the string (marked by $).

References

(1) Stack Overflow

(2) Wikipedia

0

Input file

$ cat file
example@example.com
example@gmail.com
john@smith.org
example@example.co.fr

With grep:

grep -v '@.*\..*\.' file
example@example.com
example@gmail.com
john@smith.org
  • -v invert the matches

The regular expression matches as follows:

Regex Description
@ Match the character “@” literally
.* Match a any character except newline
Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\. Match the character “.” literally
.* see upper
\. Match the character “.” literally
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
0

Using grep

$ grep -E '^\S+@\w+\.\w+$' input_file
example@example.com
example@gmail.com
john@smith.org

View demo

HatLess
  • 10,622
  • 5
  • 14
  • 32
  • Nor RFC compliant. The left part of `@` can be far more complicated – Gilles Quénot Mar 13 '23 at 10:25
  • @GillesQuénot If it is more complicated than `\S+` can handle, then OP can show it in question so it can be addressed. – HatLess Mar 13 '23 at 11:09
  • @GillesQuénot In fact my answer is not RFC compliant. That's why I assumed a reduced set of characters, only \w eq. [A-Za-z0-9_] and [.+-], instead of [A-Za-z0-9_.!#$%&'*+-/=?^_`{|}~], that are RFC compliant but very rare in practice. Sometimes pi = 3.1416 instead of pi = π , for that matter. – Jayr Magave Mar 14 '23 at 21:40