0

I would like to validate emails from text files in a directory using bash.

My regex:

grep -Eoh \
         "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,8}\b" * \
         | sort -u > mail_list

This regex satisfies all my requirements but it cannot exclude addresses such:

^%&blah@gmail.com

and

with.dot@sale..department.company-name.com

(with 2 and more dots).

These kinds of addresses should be excluded.

How can I modify this regex to exclude these types of emails?
I can use only one expression for this task.

savanto
  • 4,470
  • 23
  • 40
AlexG
  • 13
  • 2
  • 2
    A good regex to check emails: http://stackoverflow.com/a/719543/1983854 – fedorqui Jun 05 '14 at 09:28
  • 1
    Or, more modern (in terms of both regex features and address specification): http://stackoverflow.com/a/1917982/1030675 – choroba Jun 05 '14 at 09:36
  • 1
    As mentioned in the link above, regular expressions aren't really the way to go. I would suggest using something like [Email::Valid](http://search.cpan.org/~rjbs/Email-Valid-1.194/lib/Email/Valid.pm) in Perl, or [`filter_var`](http://www.php.net/manual/en/function.filter-var.php) in PHP – Tom Fenech Jun 05 '14 at 09:36

2 Answers2

1

The email address ^%&blah@gmail.com is actually a valid email address.

You can do this in Perl using the Email::Valid module (this assumes that each entry is on a new line):

perl -MEmail::Valid -ne 'print if Email::Valid->address($_)' file1 file2

file1

not email
abc@test.com

file2

not email
def@test.com
^%&blah@gmail.com
with.dot@sale..department.company-name.com

output

abc@test.com
def@test.com
^%&blah@gmail.com
Tom Fenech
  • 72,334
  • 12
  • 107
  • 141
0

Try this regex:

'\b[A-Za-z0-9]+[A-Za-z0-9._%+-]+@([A-Za-z0-9-]+\.)+[A-Za-z]{2,8}\b'

I added an alphanumeric group to the front, to force emails to begin with at least one letter or number, after which they may also have symbols.

After the @ sign, I added a group which can contain any number of letters or numbers, followed by one period. However, this group can be repeated multiple times, thus being able to match long.domain.name.com.

Finally, the regex ends with the final string as you had it, for example 'com'.


Update:

Since \b matches a word boundary, and the symbols ^%& are not considered part of the word 'blah', the above will still match blah@gmail.com even though it is preceded by undesired characters. To avoid this, use a Negative Lookbehind. This will require using grep -P instead of -E:

grep -P '(?<![%&^])\b[A-Za-z0-9]+[A-Za-z0-9._%+-]+@([A-Za-z0-9-]+\.)+[A-Za-z]{2,8}\b'

The (?<![%&^]) tells regex to match further only if the string is not preceded by the characters %&^.

Community
  • 1
  • 1
savanto
  • 4,470
  • 23
  • 40
  • Thank you! But I still can not exclude addresses such ^%&blah@gmail.com. Now in my output file I have blah@gmail.com. But my goal is to completely exclude such addresses from mail_list. For my purposes both ^%&blah@gmail.com and blah@gmail.com are not valid addresses. – AlexG Jun 05 '14 at 22:44
  • The problem is that `\b` matches the edge of a word, and those symbols are not considered part of the word 'blah'. You could try to put a space or `\s` instead of the `\b`, which would match email addresses that come after a white-space. – savanto Jun 05 '14 at 22:47
  • You can also try a [Negative Lookbehind](http://www.regular-expressions.info/lookaround.html#lookbehind) with `grep -P (?<![%&^])\b[A-Za-z0-9]+[A-Za-z0-9._%+-]+@([A-Za-z0-9-]+\.)+[A-Za-z]{2,8}\b'`. The `(?<![%&^])` tells regex to match further only if the string is not preceded by the characters `%&^`. – savanto Jun 05 '14 at 22:55
  • I tried this (\s), but it excludes all normal addreses I have. I use bash in debian, but my colleague uses ubuntu and uses \s (as you advise) and has satisfactory result. May be problem is in shell differences... – AlexG Jun 05 '14 at 22:57
  • I'm using GNU `grep` v. 2.18, on `bash`. – savanto Jun 05 '14 at 22:59