19

How to properly construct regular expression for "grep" linux program, to find all email in, say /etc directory ? Currently, my script is following:

grep -srhw "[[:alnum:]]*@[[:alnum:]]*" /etc

It working OK - a see some of the emails, but when i modify it, to catch the one-or-more charactes before- and after the "@" sign ...

grep -srhw "[[:alnum:]]+@[[:alnum:]]+" /etc

.. it stops working at all

Also, it does't catches emails of form "Name.LastName@site.com"

Help !

AntonAL
  • 16,692
  • 21
  • 80
  • 114
  • 3
    Here's a better regex to match e-mail addresses, although it requires Perl: http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html – Thomas May 24 '10 at 16:30
  • 12
    If you're not using `(?:[a-z0-9!#$%&'*+/=?^_``{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_``{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])` you're doing it wrong. http://www.regular-expressions.info/email.html – corsiKa May 24 '10 at 16:31
  • 1
    @thomas that's just ridiculous!!! :D @glowcoder yours is bad enough but... that's the most convoluted regexp i've seen in 10+ years of using them :D – chris May 24 '10 at 16:36
  • Both @thomas and @glowcoder have stumbled onto the sad truth that email addresses are really complex. A lot more complex than most people realize. However, most email addresses are rather simple ;) – D.Shawley May 24 '10 at 16:41
  • I should have put a smiley in my remark; it was not intended to be taken seriously. There is no need for such stringent validation. You're not going to know whether the address is valid until you actually send e-mail to it... – Thomas May 24 '10 at 16:46

9 Answers9

27

Here is another example

grep -Eiorh '([[:alnum:]_.-]+@[[:alnum:]_.-]+?\.[[:alpha:].]{2,6})' "$@" * | sort | uniq > emails.txt

This variant works with 3 level domains.

mosg
  • 12,041
  • 12
  • 65
  • 87
  • 1
    This one worked REALLY well for me. Include `uniq -c` to get a count of all the email addresses! Sweet! – Jess Jun 26 '13 at 17:22
  • 5
    This is a bad answer. It converts `fred+smith@company.com` to `smith@company.com`, dropping the `fred+`. The use of plus signs is very common with Gmail. Many other special characters are also allowed in the official email address spec (RFC 5322). – Chris Johnson Sep 19 '15 at 18:32
  • 2
    +1 , good pointer. Here is a slight variant "grep -Eiorh '([[:alnum:]|\._.-]+@[[:alnum:]_.-]+?\.[[:alpha:].]{2,6})' "$@" * | sort | uniq > emails.txt" which also considers '.' in the email – parolkar Apr 11 '17 at 06:56
  • This is good but all of these match: `fuU@jRpfhW.pI` `t@K6B.zzz` `A1uSO@H.qVBDV` `5@ebUQai3._a.LI` – andres.gtz Nov 07 '22 at 21:37
6

grep requires most of the regular expression special characters to be escaped - including +. You'll want to do one of these two:

grep -srhw "[[:alnum:]]\+@[[:alnum:]]\+" /etc

egrep -srhw "[[:alnum:]]+@[[:alnum:]]+" /etc
Cascabel
  • 479,068
  • 72
  • 370
  • 318
  • 2
    Sometimes, logins have underscores, so I'd add an underscore to the expression: "[[:alnum:]|_]\+@[[:alnum:]]\+" – Edmond Meinfelder May 10 '13 at 19:58
  • This is a bad answer. It includes many false positives. For example, `x@x@x@x@x` passes your regex. – Chris Johnson Sep 19 '15 at 18:26
  • @ChrisJohnson This was more about explaining how to fix what the OP had (bad escaping) than providing a perfect regex. Certainly you could reduce the false positives if this isn't just for something quick and dirty, but at that point you pretty much always want to use something other than regex. And if you are doing something quick and dirty, usually better to have false positives than false negatives. So... perfect? Definitely not? Bad? Not really. – Cascabel Sep 27 '15 at 16:59
5

I modified your regex to include punctuation (like .-_ etc) by changing it to

egrep -ho "[[:graph:]]+@[[:graph:]]+"

This still is pretty clean and matches... well, most anything with an @ in it, of course. Also 3rd level domains, also addresses with '%' or '+' in them. See http://www.delorie.com/gnu/docs/grep/grep_8.html for a good documentation on the character class used.

In my example, the addresses were surrounded by white space, making matching quite easy. If you grep through a mail server log for example, you can add < > to make it match only the addresses:

egrep -ho "<[[:graph:]]+@[[:graph:]]+>"

@thomas, @glowcoder and @oedo all are right. The RFC that defines how an eMail address can look is quite a fun read. (I've been using GNU grep 2.9 above, included in Ubuntu).

Also check out zpea's version below, it should make for a less trigger-happy matcher.

Florian Sesser
  • 5,972
  • 1
  • 25
  • 26
  • 2
    Good answer! However, it's probably better to use something like `[[:alnum:]._%+-]` instead of `[[:graph:]]` because`[:punct:]` (which is included in `[:graph:]`) contains the `@` character - which might lead to problems in matching - alongside some other characters unlikely to be found in email addresses. – zpea Aug 16 '12 at 23:51
  • 1
    This is a bad answer. It includes many false positives. For example, `x@x@x@x@x` passes your regex. – Chris Johnson Sep 19 '15 at 18:34
  • Chris is right, I wouldn't use this myself for serious tasks, like sanitizing user input. However, matching all *possible* (RFC-compliant) eMail addresses is hard, as others have pointed out. For the OP's question, Chris's `x@x@x@..` example might well be irrelevant. As always with regex, choose wisely. – Florian Sesser Sep 24 '15 at 08:35
4

I have used this one in order to filter email address identified by 'at' symbol and isolated by white spaces within a text:

egrep -o "[^[:space:]]+@[^[:space:]]+" | tr -d "<>"

Of course, you can use grep -E instead egrep (extended grep). Note that tr command is used to remove typical email delimiters.

caligari
  • 2,110
  • 20
  • 25
3

grep -E -o -r "[A-Za-z0-9][A-Za-z0-9._%+-]+@[A-Za-z0-9][A-Za-z0-9.-]+\.[A-Za-z]{2,6}" /etc

This is adapted from an answer that is not mine originally, but I found it super helpful. It's from here:

http://www.shellhacks.com/en/RegEx-Find-Email-Addresses-in-a-File-using-Grep

They suggest:

grep -E -o -r "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b" /etc

But it has certain false positives, like '+person..@example.com' or 'person@..com', and the whitespace constraints miss things like "mailto:person@example.com" (not technically an email but contains one); so I tweaked it a little bit.

(Do what you want with the options to grep, I don't know them very well)

Tyler
  • 66
  • 2
1

This recursive one works great for me :

grep -rIhEo "\b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b" /etc/*
Oli
  • 1,622
  • 18
  • 14
  • So you don't want to find my email address *@example.com then? (Yes * is a valid character) - and no I'm not at example.com, I changed that part :P – jcoder Aug 17 '12 at 16:51
  • you're right, Characters ! # $ % & ' * + - / = ? ^ _ ` { | } ~ are allowed : http://stackoverflow.com/questions/2049502/what-characters-are-allowed-in-email-address , -1 ! :) – Oli Aug 18 '12 at 21:24
0

Just wanted to mention that a slight variation of this works great for grabbing mentions from things like twitter tweets:

grep -Eiorh '(@[[:alnum:]_.-]+)' "$@" * | sort | uniq -c

PHY6
  • 391
  • 3
  • 12
  • What's with the use of both `"$@"` and `*`? You want one or the other, but not both. Anyway, this answer doesn't really belong here -- maybe post a new question instead? ISTR there are a few more twists to Twitter handles. – tripleee Mar 24 '15 at 22:34
0

Seems to work but picks up file names with @

egrep -osrwh "[[:alnum:]._%+-]+@[[:alnum:]]+\.[a-zA-Z]{2,6}" ~/.thunderbird/
PaSe
  • 93
  • 1
  • 1
0

I Bet There Are No Best Base Regex Exists Than This One

egrep -o "[a-zA-Z0-9\_\.\+\%\-]{1,}\@[a-zA-Z0-9\_\.\+\%\-]{1,}\.[a-zA-Z0-9\_\.\+\%\-]{1,}"

It Will Not Leave A Single Email From The Garbage But The Thing You Must Have To Do Is, Extract If Something Same As Email But Not Email, Like home_mobile@1x.png, Either It Needs Manual Lookup Or Make My Mentioned Regex More Specific Towards What You Want Add More Special Characters But There Are No Base Regex Exists Which Is Better Than This