0

I want a python script that opens a link and print the email address from that page.

E.g

  1. Go to some site like example.com
  2. Search for email in that.
  3. Search in all the pages in that link.

I was tried below code

import requests
from bs4 import BeautifulSoup

r = requests.get('http://www.digitalseo.in/')
data = r.text
soup = BeautifulSoup(data)

for rate in soup.find_all('@'):
    print rate.text

I take this website for reference.

Anyone help me to get this?

Ganeshgm7
  • 455
  • 1
  • 13
  • 20
  • 1
    Have you tried that? You can use [beautifulsoup](http://www.crummy.com/software/BeautifulSoup/) and [requests](http://www.python-requests.org/en/latest/) to do that. – Remi Guan Sep 24 '15 at 06:45
  • Yes. I tried with BeautifulSoup. But i cant get. – Ganeshgm7 Sep 24 '15 at 06:50
  • what is your code? what is the error message? what is the output? – Remi Guan Sep 24 '15 at 06:51
  • import requests from bs4 import BeautifulSoup r = requests.get('http://www.digitalseo.in/') data = r.text soup = BeautifulSoup(data) for rate in soup.find_all('@'): print rate.text I did't get any output. I take that website just for reference. – Ganeshgm7 Sep 24 '15 at 06:55
  • Okay, because `find_all()` function will search the **Tags**, not email address. I'll post an answer to explain this. And I think you should edit your question and add your code. – Remi Guan Sep 24 '15 at 06:58

1 Answers1

3

Because find_all() will only search Tags. From document:

Signature: find_all(name, attrs, recursive, string, limit, **kwargs)

The find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters.

So you need add a keyword argument like this:


import re
import requests
from bs4 import BeautifulSoup

r = requests.get('http://www.digitalseo.in/')
data = r.text
soup = BeautifulSoup(data, "html.parser")

for i in soup.find_all(href=re.compile("mailto")):
    print i.string

Demo:

contact@digitalseo.in
contact@digitalseo.in


From document:

Any argument that’s not recognized will be turned into a filter on one of a tag’s attributes. If you pass in a value for an argument called id, Beautiful Soup will filter against each tag's 'id' attribute:

soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

If you pass in a value for href, Beautiful Soup will filter against each tag's 'href' attribute:

soup.find_all(href=re.compile("elsie"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

You can see the document for more info: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all


And if you'd like find the email address from a document, regex is a good choice.

For example:

import re
re.findall( '[^@]+@[^@]+\.[^@]+ ', text) # remember change `text` variable

And if you'd like find a link in a page by keyword, just use .get like this:

import re
import requests
from bs4 import BeautifulSoup

def get_link_by_keyword(keyword):
    links = set()
    for i in soup.find_all(href=re.compile(r"[http|/].*"+str(keyword))):
        links.add(i.get('href'))

    for i in links:
        if i[0] == 'h':
            yield i
        elif i[0] == '/':
            yield link+i
        else:
            pass

global link
link = raw_input('Please enter a link: ')
if link[-1] == '/':
    link = link[:-1]

r = requests.get(link, verify=True)
data = r.text
soup = BeautifulSoup(data, "html.parser")

for i in get_link_by_keyword(raw_input('Enter a keyword: ')):
    print i
Community
  • 1
  • 1
Remi Guan
  • 21,506
  • 17
  • 64
  • 87