finding email address in a web page using regular expression

Question

I'm a beginner-level student of Python. Here is the code I have to find instances of email addresses from a web page.

    page = urllib.request.urlopen("http://website/category")
    reg_ex = re.compile(r'[-a-z0-9._]+@([-a-z0-9]+)(\.[-a-z0-9]+)+', re.IGNORECASE
    m = reg_ex.search_all(page)
    m.group()

When I ran it, the Python module said that there is an invalid syntax and it is on the line:

    m = reg_ex.search_all(page)

Would anyone tell me why it is invalid?

TommyOKe · Answer 1 · 2014-07-21T17:33:22.777

6

Consider an alternative:

## Suppose we have a text with many email addresses
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'

## Here re.findall() returns a list of all the found email strings
emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) 
    ## ['alice@google.com', 'bob@abc.com']    
for email in emails:
    # do something with each found email string
    print email

Source: https://developers.google.com/edu/python/regular-expressions

edited Jul 21 '14 at 17:33

answered Jul 21 '14 at 16:58

TommyOKe

119
2
5

This might be the solution the OP is looking for, but it does not answer his question... – honk Jul 21 '14 at 17:25
So if the OP asks a question where he is trying to get a certain output and asks why his code doesn't work, I am only supposed to tell him why his code doesn't work and not give him a better solution? – TommyOKe Jul 21 '14 at 17:38
No, do both. Explain why his didn't work then provide a solution and explain why it does work. – takendarkk Jul 21 '14 at 18:01
It was explained 4 times why his doesn't work, so I didn't want to be redundant. – TommyOKe Jul 21 '14 at 18:20
this regex can also match invalid email like name@example without ltd extention. – Anass Feb 22 '21 at 16:43

score 2 · Answer 2 · edited May 23 '17 at 12:32

2

You don't have closing ) at this line:

reg_ex = re.compile(r'[a-z0-9._]+@([-a-z0-9]+)(\.[-a-z0-9]+)+', re.IGNORECASE)

Plus, your regex is not valid, try this instead:

"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"

FYI, validating email using regex is not that trivial, see these threads:

edited May 23 '17 at 12:32

Community

1
1

answered Aug 08 '13 at 07:17

alecxe

462,703
120
1,088
1,195

Your suggested regex makes no sense in this use case. The OP wants to find an email address in a bunch of text, so the anchors are wrong here. – stema May 13 '14 at 08:45
@stema ok, it was just an example, but correct, no need to put boundaries. – alecxe May 13 '14 at 13:31

score 2 · Answer 3 · answered Aug 08 '13 at 07:19

2

Besides, reg_ex has no search_all method. And you should pass in page.read().

answered Aug 08 '13 at 07:19

zhangyangyu

8,520
2
33
43

score 1 · Answer 4 · answered Aug 08 '13 at 08:54

there is no .search_all method with the re module

maybe theone you are looking for is .findall

you can try

re.findall(r"(\w(?:[-.+]?\w+)+\@(?:[a-zA-Z0-9](?:[-+]?\w+)*\.)+[a-zA-Z]{2,})", text)

i assume text is the text to search, in your case should be text = page.read()

or you need to compile the regex:

r = re.compile(r"(\w(?:[-.+]?\w+)+\@(?:[a-z0-9](?:[-+]?\w+)*\.)+[a-z]{2,})", re.I)
results = r.findall(text)

Note: .findall returns a list of matches

if you need to iterate to get a match object, you can use .finditer

(from the example before)

r = re.compile(r"(\w(?:[-.+]?\w+)+\@(?:[a-z0-9](?:[-+]?\w+)*\.)+[a-z]{2,})", re.I)
for email_match in r.finditer(text):
    email_addr = email_match.group() #or anything you need for a matched object

Now the problem is what Regex you have to use :)

score 0 · Answer 5 · answered Aug 08 '13 at 07:17

0

Change r'[-a-z0-9._]+@([-a-z0-9]+)(\.[-a-z0-9]+)+' to r'[aA-zZ0-9._]+@([aA-zZ0-9]+)(\.[aA-zZ0-9]+)+'. The - character before a-z is the cause

answered Aug 08 '13 at 07:17

Prahalad Deshpande

4,709
1
20
22

finding email address in a web page using regular expression

5 Answers5