The e-mail addresses are encoded in a certain way to prevent naive scraping. Here is one such encoded e-mail address:
<p>
<a href="/cdn-cgi/l/email-protection#3851565e57784b515d4a4a595c5d564c5954165b59074b4d5a525d5b4c056a5d494d5d4b4c1d0a084c504a574d5f501d0a086c504a5d5d7a5d4b4c6a594c5d5c165b59">
<i class="fa fa-envelope-o"></i>
<span class="__cf_email__" data-cfemail="70191e161f3003191502021114151e04111c5e1311">[email protected]</span>
</a>
</p>
Which is then decoded using this JavaScript script.
So, your options are:
- Reverse-engineer the decoding script
- Use some kind of JavaScript runtime to execute the decoding script
- If you're going to use a JavaScript runtime, you might as well use
Selenium to begin with (there seems to exist a scrapy-selenium middleware that you could use if you want to stick with scrapy)
EDIT - I've reverse-engineered it for fun:
def deobfuscate(string, start_index):
def extract_hex(string, index):
substring = string[index: index+2]
return int(substring, 16)
key = extract_hex(string, start_index)
for index in range(start_index+2, len(string), 2):
yield chr(extract_hex(string, index) ^ key)
def process_tag(tag):
url_fragment = "/cdn-cgi/l/email-protection#"
href = tag["href"]
start_index = href.find(url_fragment)
if start_index > -1:
return "".join(deobfuscate(href, start_index + len(url_fragment)))
return None
def main():
import requests
from bs4 import BeautifulSoup as Soup
from urllib.parse import unquote
url = "https://threebestrated.ca/children-dentists-in-airdrie-ab"
response = requests.get(url)
response.raise_for_status()
soup = Soup(response.content, "html.parser")
print("E-Mail Addresses from <a> tags:")
for email in map(unquote, filter(None, map(process_tag, soup.find_all("a", href=True)))):
print(email)
cf_elem_attr = "data-cfemail"
print("\nE-Mail Addresses from tags where \"{}\" attribute is present:".format(cf_elem_attr))
for tag in soup.find_all(attrs={cf_elem_attr:True}):
email = unquote("".join(deobfuscate(tag[cf_elem_attr], 0)))
print(email)
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
E-Mail Addresses from <a> tags:
info@sierradental.ca?subject=Request through ThreeBestRated.ca
reviews@threebestrated.ca?subject=My Review for Dr. Amin Salmasi in Airdrie
info@mainstreetdentalairdrie.ca?subject=Request through ThreeBestRated.ca
reviews@threebestrated.ca?subject=My Review for Dr. James Yue in Airdrie
friends@toothpals.ca?subject=Request through ThreeBestRated.ca
reviews@threebestrated.ca?subject=My Review for Dr. Christine Bell in Airdrie
support@threebestrated.ca
E-Mail Addresses from tags where "data-cfemail" attribute is present:
info@sierradental.ca
friends@toothpals.ca
support@threebestrated.ca
>>>