2

I'm a beginner-level student of Python. Here is the code I have to find instances of email addresses from a web page.

    page = urllib.request.urlopen("http://website/category")
    reg_ex = re.compile(r'[-a-z0-9._]+@([-a-z0-9]+)(\.[-a-z0-9]+)+', re.IGNORECASE
    m = reg_ex.search_all(page)
    m.group()

When I ran it, the Python module said that there is an invalid syntax and it is on the line:

    m = reg_ex.search_all(page)

Would anyone tell me why it is invalid?

Sameer Singh
  • 1,358
  • 1
  • 19
  • 47
Kyungho Park
  • 101
  • 2
  • 2
  • 3

5 Answers5

6

Consider an alternative:

## Suppose we have a text with many email addresses
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'

## Here re.findall() returns a list of all the found email strings
emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) 
    ## ['alice@google.com', 'bob@abc.com']    
for email in emails:
    # do something with each found email string
    print email

Source: https://developers.google.com/edu/python/regular-expressions

TommyOKe
  • 119
  • 2
  • 5
  • This might be the solution the OP is looking for, but it does not answer his question... – honk Jul 21 '14 at 17:25
  • So if the OP asks a question where he is trying to get a certain output and asks why his code doesn't work, I am only supposed to tell him why his code doesn't work and not give him a better solution? – TommyOKe Jul 21 '14 at 17:38
  • No, do both. Explain why his didn't work then provide a solution and explain why it does work. – takendarkk Jul 21 '14 at 18:01
  • It was explained 4 times why his doesn't work, so I didn't want to be redundant. – TommyOKe Jul 21 '14 at 18:20
  • this regex can also match invalid email like name@example without ltd extention. – Anass Feb 22 '21 at 16:43
2

You don't have closing ) at this line:

reg_ex = re.compile(r'[a-z0-9._]+@([-a-z0-9]+)(\.[-a-z0-9]+)+', re.IGNORECASE)

Plus, your regex is not valid, try this instead:

"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"

FYI, validating email using regex is not that trivial, see these threads:

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Your suggested regex makes no sense in this use case. The OP wants to find an email address in a bunch of text, so the anchors are wrong here. – stema May 13 '14 at 08:45
  • @stema ok, it was just an example, but correct, no need to put boundaries. – alecxe May 13 '14 at 13:31
2

Besides, reg_ex has no search_all method. And you should pass in page.read().

zhangyangyu
  • 8,520
  • 2
  • 33
  • 43
1

there is no .search_all method with the re module

maybe theone you are looking for is .findall

you can try

re.findall(r"(\w(?:[-.+]?\w+)+\@(?:[a-zA-Z0-9](?:[-+]?\w+)*\.)+[a-zA-Z]{2,})", text)

i assume text is the text to search, in your case should be text = page.read()

or you need to compile the regex:

r = re.compile(r"(\w(?:[-.+]?\w+)+\@(?:[a-z0-9](?:[-+]?\w+)*\.)+[a-z]{2,})", re.I)
results = r.findall(text)

Note: .findall returns a list of matches

if you need to iterate to get a match object, you can use .finditer

(from the example before)

r = re.compile(r"(\w(?:[-.+]?\w+)+\@(?:[a-z0-9](?:[-+]?\w+)*\.)+[a-z]{2,})", re.I)
for email_match in r.finditer(text):
    email_addr = email_match.group() #or anything you need for a matched object

Now the problem is what Regex you have to use :)

Kadmillos
  • 74
  • 5
0

Change r'[-a-z0-9._]+@([-a-z0-9]+)(\.[-a-z0-9]+)+' to r'[aA-zZ0-9._]+@([aA-zZ0-9]+)(\.[aA-zZ0-9]+)+'. The - character before a-z is the cause

Prahalad Deshpande
  • 4,709
  • 1
  • 20
  • 22