3

I am looking to match e-mail addresses in a text document for which I am writing a regex. I have come up with something like this for starters -

((?:[a-zA-Z]+[\w+\.\-]+[\-a-zA-Z]+))[ ]*((?:@|at))[ ]*(?:[a-zA-Z\.]+)

I want to make sure that the end of the e-mail address is a 'edu' or 'com'. How do I do this? I am using Python.

Some sample e-mail addresses from my text document

alice @ so.edu
alice at sm.so.edu
alice @ sm.com

Edit -

I want to make a change to this regex ONLY. My regex fits a few other examples in my data.

Dexter
  • 11,311
  • 11
  • 45
  • 61

2 Answers2

2
((?:[a-zA-Z]+[\w+\.\-]+[\-a-zA-Z]+))[ ]*((?:@|at))[ ]*(?:[a-zA-Z\.]+)\.(com|edu)

EDIT: For a "dot" instead of ".":

((?:[a-zA-Z]+[\w+\.\-]+[\-a-zA-Z]+))[ ]*((?:@|at))[ ]*(?:[a-zA-Z\.]+) *(\.|dot) *(com|edu)
wrongusername
  • 18,564
  • 40
  • 130
  • 214
  • Why is the '$' not needed? Because - (?:[a-zA-Z\.]+)\. gets through the whole domain/sub-domain thing right? Just want to confirm if I have understood it correctly. – Dexter Mar 24 '12 at 23:06
  • 2
    @mcenly Hm, I don't see a `$` in your regex or mine. A `$` would match only e-mail addresses at the end of a line. That may or may not be what you want. In your sample document with only one e-mail per line, I don't think it matters at all, but in a text document with e-mails sprawled throughout, a `$` would cause only those e-mails at the end of a line to be matched. – wrongusername Mar 24 '12 at 23:08
  • by end of line you mean \n right and not end of string? Sorry to bother you on this. – Dexter Mar 24 '12 at 23:10
  • @mcenley Yes, that's what I meant! – wrongusername Mar 24 '12 at 23:11
  • I also have e-mails of the pattern 'alice @ bob dot com' in my dataset. The mentioned regex won't work then right? – Dexter Mar 24 '12 at 23:14
1

First of all, see this answer for explanation of how to match all valid e-mail addresses as per RFC822.

I would personally not modify the regular expression, but use email.Utils.parseaddr() on the regexp matches instead and check that the resulting string .endswith("edu") or .endswith("com"). E.g.

>>> email.Utils.parseaddr("kimvais@mailinator.com")[1].endswith(".com")
True
Community
  • 1
  • 1
Kimvais
  • 38,306
  • 16
  • 108
  • 142