Python web Automation to get Email from Webpage

Question

I want a python script that opens a link and print the email address from that page.

E.g

Go to some site like example.com
Search for email in that.
Search in all the pages in that link.

I was tried below code

import requests
from bs4 import BeautifulSoup

r = requests.get('http://www.digitalseo.in/')
data = r.text
soup = BeautifulSoup(data)

for rate in soup.find_all('@'):
    print rate.text

I take this website for reference.

Anyone help me to get this?

Have you tried that? You can use [beautifulsoup](http://www.crummy.com/software/BeautifulSoup/) and [requests](http://www.python-requests.org/en/latest/) to do that. — Remi Guan, Sep 24 '15 at 06:45
what is your code? what is the error message? what is the output? — Remi Guan, Sep 24 '15 at 06:51
import requests from bs4 import BeautifulSoup r = requests.get('http://www.digitalseo.in/') data = r.text soup = BeautifulSoup(data) for rate in soup.find_all('@'): print rate.text I did't get any output. I take that website just for reference. — Ganeshgm7, Sep 24 '15 at 06:55
Okay, because `find_all()` function will search the **Tags**, not email address. I'll post an answer to explain this. And I think you should edit your question and add your code. — Remi Guan, Sep 24 '15 at 06:58

score 3 · Accepted Answer · edited Jun 20 '20 at 09:12

3

Because find_all() will only search Tags. From document:

Signature: find_all(name, attrs, recursive, string, limit, **kwargs)

The find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters.

So you need add a keyword argument like this:

import re
import requests
from bs4 import BeautifulSoup

r = requests.get('http://www.digitalseo.in/')
data = r.text
soup = BeautifulSoup(data, "html.parser")

for i in soup.find_all(href=re.compile("mailto")):
    print i.string

Demo:

contact@digitalseo.in
contact@digitalseo.in

From document:

Any argument that’s not recognized will be turned into a filter on one of a tag’s attributes. If you pass in a value for an argument called id, Beautiful Soup will filter against each tag's 'id' attribute:

soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

If you pass in a value for href, Beautiful Soup will filter against each tag's 'href' attribute:

soup.find_all(href=re.compile("elsie"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

You can see the document for more info: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all

And if you'd like find the email address from a document, regex is a good choice.

For example:

import re
re.findall( '[^@]+@[^@]+\.[^@]+ ', text) # remember change `text` variable

And if you'd like find a link in a page by keyword, just use .get like this:

import re
import requests
from bs4 import BeautifulSoup

def get_link_by_keyword(keyword):
    links = set()
    for i in soup.find_all(href=re.compile(r"[http|/].*"+str(keyword))):
        links.add(i.get('href'))

    for i in links:
        if i[0] == 'h':
            yield i
        elif i[0] == '/':
            yield link+i
        else:
            pass

global link
link = raw_input('Please enter a link: ')
if link[-1] == '/':
    link = link[:-1]

r = requests.get(link, verify=True)
data = r.text
soup = BeautifulSoup(data, "html.parser")

for i in get_link_by_keyword(raw_input('Enter a keyword: ')):
    print i

edited Jun 20 '20 at 09:12

Community

1
1

answered Sep 24 '15 at 07:13

Remi Guan

21,506
17
64
87

It's working. Is is possible to find the mail based on the @ symbol. – Ganeshgm7 Sep 24 '15 at 07:20
No, this will search `href=mailto`. If you check the HTML then you will see some thing like ``. – Remi Guan Sep 24 '15 at 07:23
Is there is any way to find E-mail based on the @ symbol. Because in some situations will be listed without – Ganeshgm7 Sep 24 '15 at 07:26
Yes, , if you'd like search the email address in a document or some string, use regex instead BeautifulSoup. Let me edit my answer and add it. – Remi Guan Sep 24 '15 at 07:35
Okay sir. Thank you. It was so helpful. Thanks a lot. – Ganeshgm7 Sep 24 '15 at 07:36
Using regex can we find mail address from webpage.? – Ganeshgm7 Sep 24 '15 at 08:18
Yes, you can try it :) – Remi Guan Sep 24 '15 at 08:18
Everything works fine..! Thank you. How do parse only contact us URL in a specific web page..? – Ganeshgm7 Sep 24 '15 at 10:56
Parse URL in a specific web page? What is the web page looks like? – Remi Guan Sep 24 '15 at 11:01
Consider my scenario, I have this website http://www.digitalseo.in/. In that many links are available. But i need only Contact-us page link.? – Ganeshgm7 Sep 24 '15 at 11:04
So that is you only want a part of links, and search by some keyword? – Remi Guan Sep 24 '15 at 11:11
Yes. I need only contact and about page links. – Ganeshgm7 Sep 24 '15 at 11:12
OK, maybe I just need change the keyword...wait me a minute. – Remi Guan Sep 24 '15 at 11:13
See what did I add, now you can just use a *keyword* then get the link :) – Remi Guan Sep 24 '15 at 11:30
Sorry for trouble again. If i use www.davimac.com.au instead of http://www.digitalseo.in/ i did't get any output. I know the links was not started with http. But i don't know how to fix this type issues. – Ganeshgm7 Sep 24 '15 at 12:00
No problem, let me check this link. – Remi Guan Sep 24 '15 at 12:00
Well, seems like I can't connect to that web site(I know why, but I can't fix it). What are the links looks like? like ``/foo/bar``? – Remi Guan Sep 24 '15 at 12:05
When extracting the link alone from that site, those links looks like this /Products/davimac-small-seeds-drive-wheel /davimac-dealers /contact /blog /field-days – Ganeshgm7 Sep 24 '15 at 12:07
Okay sir. Thank you..! – Ganeshgm7 Sep 24 '15 at 12:09
Finished, now this program will ask for a link and a keyword. Try it. – Remi Guan Sep 24 '15 at 12:19
Sir I got SSL error when processing the https urls. Is there is any way fix this.? – Ganeshgm7 Sep 25 '15 at 10:51
Sure, That's easy to fix this. Let me edit my answer. – Remi Guan Sep 25 '15 at 10:52
I've just add `verify=False` in `r = requests.get(link, verify=False)`. What about now? – Remi Guan Sep 25 '15 at 10:56
No sir. Still i got the same error. It gives some SSL related errors, then in the last line the error looks llike urllib2.HTTPError: HTTP Error 403: Forbidden. – Ganeshgm7 Sep 25 '15 at 11:00
Hmm...`403: Forbidden` means that you haven't permission to connect to that web site. Did you tried use the web browser to connect that web site? – Remi Guan Sep 25 '15 at 11:02
THIS IS THE ERROR : /usr/lib/python2.6/site-packages/requests-2.7.0-py2.6.egg/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning. InsecurePlatformWarning – Ganeshgm7 Sep 25 '15 at 11:05
I've just google about it. Please try [this solution](http://stackoverflow.com/questions/29099404/ssl-insecureplatform-error-when-using-requests-package). – Remi Guan Sep 25 '15 at 11:07
I installed that package. But still same problem. – Ganeshgm7 Sep 25 '15 at 11:18
Try reboot the system? – Remi Guan Sep 25 '15 at 11:19
Yes. I check it after rebooting the system. Still getting same error. – Ganeshgm7 Sep 25 '15 at 11:28
Well, which OS are you on? – Remi Guan Sep 25 '15 at 12:21
I am using CentOs 6. With the python version 2.6.5. Is it possible to speak with you personally?, – Ganeshgm7 Sep 25 '15 at 12:26
personally? Do you mean use phone? – Remi Guan Sep 25 '15 at 12:30
Yeah. Using Skype or using social media sites. – Ganeshgm7 Sep 25 '15 at 12:31
I don't have the account of Skype and I didn't install it, What about Google plus? – Remi Guan Sep 25 '15 at 12:32
And I think CentOS wouldn't get error, what is the web site? – Remi Guan Sep 25 '15 at 12:32
All the website (https )with ssl giving error. I already invited you in Google+ – Ganeshgm7 Sep 25 '15 at 12:35
I didn't see that, let me check :) – Remi Guan Sep 25 '15 at 12:35
I'm sorry about I'm just search and test the code, seems like it will raise an error but will also work. – Remi Guan Sep 25 '15 at 12:54

Python web Automation to get Email from Webpage

1 Answers1