Because find_all()
will only search Tags. From document:
Signature: find_all(name, attrs, recursive, string, limit, **kwargs)
The find_all()
method looks through a tag’s descendants and retrieves all descendants that match your filters.
So you need add a keyword argument like this:
import re
import requests
from bs4 import BeautifulSoup
r = requests.get('http://www.digitalseo.in/')
data = r.text
soup = BeautifulSoup(data, "html.parser")
for i in soup.find_all(href=re.compile("mailto")):
print i.string
Demo:
contact@digitalseo.in
contact@digitalseo.in
From document:
Any argument that’s not recognized will be turned into a filter on one of a tag’s attributes. If you pass in a value for an argument called id, Beautiful Soup will filter against each tag's 'id' attribute:
soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
If you pass in a value for href
, Beautiful Soup will filter against each tag's 'href' attribute:
soup.find_all(href=re.compile("elsie"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
You can see the document for more info: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all
And if you'd like find the email address from a document, regex
is a good choice.
For example:
import re
re.findall( '[^@]+@[^@]+\.[^@]+ ', text) # remember change `text` variable
And if you'd like find a link in a page by keyword, just use .get
like this:
import re
import requests
from bs4 import BeautifulSoup
def get_link_by_keyword(keyword):
links = set()
for i in soup.find_all(href=re.compile(r"[http|/].*"+str(keyword))):
links.add(i.get('href'))
for i in links:
if i[0] == 'h':
yield i
elif i[0] == '/':
yield link+i
else:
pass
global link
link = raw_input('Please enter a link: ')
if link[-1] == '/':
link = link[:-1]
r = requests.get(link, verify=True)
data = r.text
soup = BeautifulSoup(data, "html.parser")
for i in get_link_by_keyword(raw_input('Enter a keyword: ')):
print i