1

I'm trying to develop 2 regular expressions, one in javascript and the other in php, that will capture the email address(es), found in a raw email message, that only pertain to it's respective field (e.g. To:) and only it's field (i.e. no other emails from anywhere else in the corpus) but I've had no success.

Here are the ideal requirements:

  • Must begin on a new line and from the beginning of that line.

  • New line must begin with "To", for example (double quotes excluded, case-insensitive, a single occurrence of colon is optional, a single or unlimited number of spaces optional).

  • Must capture all email addresses thereafter, individually, upto the last email address but before a non-email address word (examples of non-email address words, but not specifically: Subject:, From:, CC:, Hello, etc...)

I've had success with requirements #1 and #2 but have struggled with #3. I've been forced to simply solve for #1 and #2 and simply split/explode the results based on the commas, which is fine, but I know better can be had.

Here is a sample email from the pulic dataset of Enron email

Message-ID: <3470405.1075840065684.JavaMail.evans@thyme>
Date: Sun, 14 Feb 1999 01:33:00 -0800 (PST)
From: markskilling@hotmail.com
To: majalinda@hotmail.com, ksbiehl@hotmail.com, dlmackler@worldnet.att.net, 
    cjones@cityofnapa.org, hazerfen@hotmail.com, meyerjames@usa.net, 
    tomskilljr@aol.com, c.combs@intershop.com, mshachat@aol.com, 
    clowes@email.msn.com, clowes@cmithlaw.com, transwd@aol.com, 
    smackarnes@aol.com, samjstokes@aol.com, joguti@aol.com, 
    bjmackaysmith@hotmail.com, m_larnold@sprynet.com, dwood@rwblaw.com, 
    daveroche@aol.com, milobenn@sirius.com, pwc1@aol.com, 
    candc@ix.netcom.com, eisenbachrl@cooley.com, mwf15@columbia.edu, 
    khuber@hcmwealth.com, doyna@coffeenet.com, katekross@aol.com, 
    mark.langermann@issna.com, martin@sbu.edu, deniz.razon@abbott.com, 
    sras@lycosmail.com, jeff.skilling@enron.com, tskilling@tribune.com, 
    audryn@mindspring.com, mmmmisha@ix.netcom.com, ermak@gte.net
Subject: 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: "Mark Skilling" <markskilling@hotmail.com>
X-To: majalinda@hotmail.com, ksbiehl@hotmail.com, dlmackler@worldnet.att.net, cjones@cityofnapa.org, hazerfen@hotmail.com, meyerjames@usa.net, tomskilljr@aol.com, c.combs@intershop.com, mshachat@aol.com, clowes@email.msn.com, clowes@cmithlaw.com, transwd@aol.com, smackarnes@aol.com, samjstokes@aol.com, joguti@aol.com, bjmackaysmith@hotmail.com, m_larnold@sprynet.com, dwood@rwblaw.com, daveroche@aol.com, milobenn@sirius.com, pwc1@aol.com, candc@ix.netcom.com, eisenbachrl@cooley.com, mwf15@columbia.edu, khuber@hcmwealth.com, doyna@coffeenet.com, katekross@aol.com, mark.langermann@issna.com, martin@sbu.edu, deniz.razon@abbott.com, sras@lycosmail.com, Jeff Skilling, tskilling@tribune.com, audryn@mindspring.com, mmmmisha@ix.netcom.com, ermak@GTE.net
X-cc: 
X-bcc: 
X-Folder: \Jeffrey_Skilling_Dec2000\Notes Folders\All documents
X-Origin: SKILLING-J
X-FileName: jskillin.nsf

February 10, 1999

I am wakened by the approaching chatter of the early morning call to
prayer (sounding a bit like the fuss made by one of those cartoon balls
of fighting dogs and cats).  From the minarets of far away mosques, the
muezzins' cries ricochet through Istanbul's still dark alleys and
streets.  Seagulls, who have drifted up the hill from the Golden Horn,
squawk contentedly outside my window.  From somewhere down below, a
miserable dog joins into the pre-dawn ruckus, soon followed by the local
muezzin, whose amplified singing drowns out all the rest.  He reminds us
that God is great and that prayer is a whole lot more important than
sleep (at least that's what I've been told; he sings in Arabic).
Because my religion thinks more highly of sleep, I feel free to simply
listen, while gently trying to pull the warm blanket of sleep back over
me.  The  muezzin has a beautiful voice.  Its rise and fall stitches
itself into the edges of a dream (in which a former best friend and I
argue about the rules of a game of miniature golf) hanging just out of
reach.

Slowly, the banal calculations that fill my days begin to crowd their
way into my head.  It's about a quarter to six, I figure, which means
there's time for a bit of writing, or even Turkish vocabulary, before I
douse myself in the shower to full consciousness.  I remind myself of
the theory that one can write most freely while still intoxicated with
sleep (or just plain intoxicated), am immediately stricken with the fear
I am incapable of such freedom, take a look round my brain for something
worth writing about (find nothing), hypothesize about the advantages of
a quick dash into the hallway to turn on the gas heater (so that when I
really get up it will be reasonably warm out there), wonder if I really
do have enough stuff prepared to fill up the two hours of my English
lesson with Suleyman, conclude that all this thinking has probably made
any more sleep impossible, then (I realize later) fall back to sleep.

                               *     *     *

My new phone [(212) 292-6486] is hooked up and I have a new internet
server, which will make it much easier to keep in touch.  Hope to attack
that backlog of responses that are due.

Keep in touch.

Mark-O

______________________________________________________
Get Your Private, Free Email at http://www.hotmail.com

Thanks for your help. Hope this can benefit anyone else searching! :)

Here is the regex I'm currently using that satisfies requirement #1 and #2 and returns the blob of recipients for the particular field:

/^(?:To:?(?:\s+)?)((?:(?:(?:(?:[^<>()[\]\\.,;:\s@\"]+(?:\.[^<>()[\]\\.,;:\s@\"]+)*)|(?:\".+\"))@(?:(?:\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(?:(?:[a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))),?\s+)+)+/mi;
user3871070
  • 17
  • 1
  • 5
  • This is from a dataset that exists in the public domain; they're as good as examples. https://www.cs.cmu.edu/~./enron/ – user3871070 Aug 31 '14 at 04:13
  • It's almost like using a brick **KNOWING** there's better out there to drive a nail into a wall. We agree that the solution suffices but I **KNOW** there's better out there and this venue will help me find it. I'm in the middle of Home Depot someones bound to find my hammer. Besides, SO is a refining platform. – user3871070 Aug 31 '14 at 05:19
  • Why do you think regex is better than simply matching up to a point and then splitting instead of trying to use a recursive regex to capture each individual email address?? – hwnd Aug 31 '14 at 05:34
  • My goal was to retrieve the desired result in one action. – user3871070 Aug 31 '14 at 06:12
  • Where's your attempts? I see what your 'ideal requirements' are, although I don't see any code you've tried. – l'L'l Aug 31 '14 at 06:15
  • My apologies. Updated. :) – user3871070 Aug 31 '14 at 06:38

1 Answers1

0

After you've solved for #1 and #2, you could use this to grab any of the email addresses from your selection

[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4})

this should grab any valid email address and won;t grab invalid ones like: example@gmail...com

link: Using a regular expression to validate an email address

Community
  • 1
  • 1
Zack
  • 874
  • 1
  • 9
  • 18
  • Possibly but I'd rather do it **while** I'm solving for #1 and #2. and with this email validating expression: (([^<>()[\]\\.,;:\s@\"]+(\.[^<>()[\]\\.,;:\s@\"]+)*)|(\".+\"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,})) – user3871070 Aug 31 '14 at 05:14