This is the regular expression I have formed so far:
/(?:("?(?:.*)"?)\s*)?\s<(.*@.*)>|(?:mailto:(.*@.*))|(.*@.*)/gi
You can check it out at regex101
I'm trying to extract 'Name' & 'Email' from the following:
John Smith <john.smith@gmail.com>
John Smith <johnsmith@gmail.com>
"John Smith" <johnsmith@gmail.com>
"John" <johnsmith@gmail.com>
John Smith<johnsmith@gmail.com>
<johnsmith@gmail.com>
johnsmith@gmail.com
mailto:johnsmith@gmail.com
"John"<johnsmith@gmail.com>
To: John Smith <john.smith@gmail.com>
From: John Smith <john.smith@gmail.com>
Reply-to: john.smith@gmail.com
Return-path: <john.smith@gmail.com>
Message-id: <john.smith@gmail.com>
References: <john.smith@gmail.com>
Original-recipient: rfc822;john.smith@gmail.com
for john.smith@gmail.com
ESMTPSA id <john.smith@gmail.com>
domain of john.smith@gmail.com
envelope-from=john.smith@gmail.com
(ORCPT john.smith@gmail.com)
Having started from scratch, I feel as if I'm almost there - but having trouble with 3 things:
Stripping double quotes from the first capturing group
Dealing with the whitespace missing variant:
John Smith<johnsmith@gmail.com>
False positives in the 'Name' field for the latter block, so I need a way of excluding these (perhaps using the preceding
:
,:
,=
,for
,id
,of
?)
As a complete regular expression novice, I would appreciate a little direction from someone knowledgeable on how I might overcome these issues.
For the curious, I've unfortunately lost my CardDAV and thus all contacts, so in true Linux fashion, I'm going to rebuild a list of emails by manually parsing my entire raw MBOX, sorting by most common, and go from there.
I will be using bash grep
, or perl sed
.
Thank you for you time!