1

This is the regular expression I have formed so far:

/(?:("?(?:.*)"?)\s*)?\s<(.*@.*)>|(?:mailto:(.*@.*))|(.*@.*)/gi

You can check it out at regex101

I'm trying to extract 'Name' & 'Email' from the following:

John Smith <john.smith@gmail.com>
John Smith <johnsmith@gmail.com>
"John Smith" <johnsmith@gmail.com>
"John" <johnsmith@gmail.com>
John Smith<johnsmith@gmail.com>
<johnsmith@gmail.com>
johnsmith@gmail.com
mailto:johnsmith@gmail.com
"John"<johnsmith@gmail.com>

To: John Smith <john.smith@gmail.com>
From: John Smith <john.smith@gmail.com>
Reply-to: john.smith@gmail.com
Return-path: <john.smith@gmail.com>
Message-id: <john.smith@gmail.com>
References: <john.smith@gmail.com>
Original-recipient: rfc822;john.smith@gmail.com
for john.smith@gmail.com
ESMTPSA id <john.smith@gmail.com>
domain of john.smith@gmail.com
envelope-from=john.smith@gmail.com
(ORCPT john.smith@gmail.com)

Having started from scratch, I feel as if I'm almost there - but having trouble with 3 things:

  • Stripping double quotes from the first capturing group

  • Dealing with the whitespace missing variant: John Smith<johnsmith@gmail.com>

  • False positives in the 'Name' field for the latter block, so I need a way of excluding these (perhaps using the preceding :, :, =, for, id, of?)

As a complete regular expression novice, I would appreciate a little direction from someone knowledgeable on how I might overcome these issues.

For the curious, I've unfortunately lost my CardDAV and thus all contacts, so in true Linux fashion, I'm going to rebuild a list of emails by manually parsing my entire raw MBOX, sorting by most common, and go from there.

I will be using bash grep, or perl sed.

Thank you for you time!

Kier
  • 55
  • 2
  • 10
  • 1
    bash and javascript? – hjpotter92 Nov 21 '15 at 10:25
  • I would like to use bash/grep as I currently am if at all possible. But as I understand it there are regex complexity limitations, so I would be happy to use javascript or perl instead if need be. – Kier Nov 21 '15 at 10:40
  • 1
    JS regex are more limited than bash. And bash provides `sed`, which uses Perl's regexes. – hjpotter92 Nov 21 '15 at 10:42
  • I wasn't aware of that, I had read that the opposite is true. I know about using GNU `grep` since it can use perl regex, but I hadn't thought of `sed`. Either way, I'm more comfortable with bash, so using `grep` or `sed` is perfect. – Kier Nov 21 '15 at 10:46
  • Since you're choosing the language, can it be done in PHP? – Mariano Nov 21 '15 at 12:17
  • 1
    From your examples, `envelope-from=john.smith@gmail.com` *could* be a valid email address as well; without context, it's impossible to tell. – tripleee Nov 21 '15 at 12:59
  • @hjpotter92 `sed` does *not* support Perl regular expressions in any popular/common version. – tripleee Nov 21 '15 at 13:01
  • @tripleee, I don't think this is a duplicate of [this](http://stackoverflow.com/questions/201323/using-a-regular-expression-to-validate-an-email-address), as the Asker's goal is not to validate, but to extract (which requires less strict matching), and not only an email address, but also a name, while discarding other prefixed text. Please reconsider. – trincot Nov 22 '15 at 09:12
  • @triplee, I concur. I did scour Stack Exchange for hours in search of an answer before posting my question. There are some related posts, but as trincot has mentioned, none that addressed a name as well as email while discarding prefixed text. – Kier Nov 22 '15 at 10:42
  • @trincot & OP I am very hesitant to reopen this, for a number of reasons. If you studied previous questions then you know there are a number of common shortcuts which fail on some addresses but simplify the task a lot, but you have not stated your requirements for these (e.g. quoted localpart, international domain names, etc). If you studied previous questions, you know that regex is less than ideal for this task. ... – tripleee Nov 22 '15 at 12:46
  • ... If your requirements for extraction are less demanding than for validation, one of the better validation regexes should be useful as a starting point (and frankly, just accepting Firstname Lastname in front would tackle most real-world cases, and you probably cannot achieve 100% accuracy anyway). The answers so far are not particularly striking, and in fact repeat many mistakes from the poorer answers to existing questions. And yet, you already accepted one of them. Do you seriously believe that you can get good new answers? Why? – tripleee Nov 22 '15 at 12:46
  • Please keep in mind that I am by no means the final authority here, though. If you feel that this has been handled incorrectly, by all means open a question on https://meta.stackoverflow.com/ or flag for moderator attention. – tripleee Nov 22 '15 at 13:14

2 Answers2

0

Just a suggestion. May be for you will be more suitable to check for "before email" and "email" and after extracting to handle "before email" from program logic. Like this:

((?:(?![a-z.]+@[a-z.]+\.[a-z]{2,4})(?:.|\r))+)([a-z.]+@[a-z.]+\.[a-z]{2,4})

((?:(?!regex)(?:.|\r))+)(regex) - this mean "something that not matched regex" but instead this is every symbol including character return repeated from one to an infinite number of times and remember this into first backreference and after this try to match regex and put it in second backreference.

Edit: If you want to handle cases when first group don't exists (there is email present only) here is modified version.

((?:(?![a-z.]+@[a-z.]+\.[a-z]{2,4})(?:.|\r))*)([a-z.]+@[a-z.]+\.[a-z]{2,4})

* instead + 

Edit2: Improvement according to trincot comment.

((?:(?![^><@\s=;]+@[^><@\s=;]+\.[a-z]{2,4})(?:.|\r))*)([^><@\s=;]+@[^><@\s=;]+\.[a-z]{2,4})
Georgi Naumov
  • 4,160
  • 4
  • 40
  • 62
  • 1
    Thank you, I do appreciate this logic, I hadn't thought about doing it like that. The only thing it doesn't pick up is the 'johnsmith@gmail.com' variant. You can see it on [regex101](https://regex101.com/r/zB9dR6/1). Although handling the 'before' with program logic is possible, if I *could* squeeze this into an all-powerful regex, it would make things much simpler for me. I have years and years of email to parse, that's a lot of false positives... – Kier Nov 21 '15 at 10:54
  • I edit the answer to handle this case also. – Georgi Naumov Nov 21 '15 at 11:17
  • 1
    email addresses can have more than just [a-z.] – trincot Nov 21 '15 at 11:33
  • Yes. Just bear in the mind. – Georgi Naumov Nov 21 '15 at 11:41
  • And many valid, current top-level domains have a lot more than four characters, many from a broader set than the overly strict `[a-z]`. – tripleee Nov 21 '15 at 13:03
0

Here is another possible regex, which I split in three lines for clarity, but should be on one line:

\s*(?:.*?[:=;]|ORCPT|for|domain of|ESMTPSA id)?
\s*(?:"?([\w ]*?)[ "<])?
\s*<?([\w.]*?@[\w.]*)>?

The first line eliminates the prefixes, and therefore is non-capturing. It eliminates anything that ends with :;= or some specific literals.

The second and third line are the two capturing groups for name and email respectively.

It correctly parses the example you provided.

See regex fiddle.

Solution with Perl

You can launch this perl onliner:

perl -ne 'while(/.../gi){print "$1|$2\n";}' yourinputfile

This will output the captured groups 1 and 2, separated by a pipe character:

John Smith|john.smith@gmail.com
John Smith|johnsmith@gmail.com
John Smith|johnsmith@gmail.com
John|johnsmith@gmail.com
John Smith|johnsmith@gmail.com
|johnsmith@gmail.com
|johnsmith@gmail.com
|johnsmith@gmail.com
John|johnsmith@gmail.com
John Smith|john.smith@gmail.com
John Smith|john.smith@gmail.com
|john.smith@gmail.com
|john.smith@gmail.com
|john.smith@gmail.com
|john.smith@gmail.com
|john.smith@gmail.com
|john.smith@gmail.com
|john.smith@gmail.com
|john.smith@gmail.com
|john.smith@gmail.com
|john.smith@gmail.com
trincot
  • 317,000
  • 35
  • 244
  • 286
  • Excellent job, trincot. With the non-capturing group it's perfect - but there's a trailing space in the name `"John Smith "`. How can I correct this, while still respecting the space between first name and last name? I've played with `([\w ]*)[ "<])?\s*` to no avail. Many thanks. – Kier Nov 22 '15 at 01:54
  • Fixed that by making the "name" capturing group non-greedy (changed `[\w ]*` to `[\w ]*?`). I also made the prefix-matching non-capturing, and updated the regex fiddle accordingly. – trincot Nov 22 '15 at 08:57
  • You have my eternal thanks trincot! Plus, answer acceptance. :) Lastly, the regex seems to be incompatible with `grep`, I've tried both double and single quotes. I guess the regex is too advanced for `grep` - could you recommend another cli tool, preferably already bundled in vanilla linux? `$ grep -E o -h '/\s*(?:.*?[:=;]|ORCPT|for|domain of|ESMTPSA id)?\s*(?:"?([\w ]*?)[ "<])?\s*([\w.]*?@[\w.]*)>?/gi' examples` – Kier Nov 22 '15 at 10:38
  • I'm assuming the issue is unescaped characters? Must I escape them all? – Kier Nov 22 '15 at 10:50
  • It seems to make more sense to use an alternative to `grep` instead. I've read about `grep -P` but it doesn't seem to work for me either on Linux Mint. – Kier Nov 22 '15 at 10:56
  • In your `grep` statement, I notice a missing hyphen before the `o` option. I suppose this is not the only issue, but I thought I better mention it. I will look further. – trincot Nov 22 '15 at 11:12
  • Found a solution: perl. Added to my answer. – trincot Nov 22 '15 at 12:08