Scraping data not in view page source using python scrapy

Question

I want to scrape emails of this link:

https://threebestrated.ca/children-dentists-in-airdrie-ab

but the output shows null because these are not in the view page source.

This is the code:

import scrapy
class BooksSpider(scrapy.Spider):
    name = "3bestrated"
    allowed_domains = ['threebestrated.ca']
    start_urls = ["https://threebestrated.ca/children-dentists-in-airdrie-ab"]

    def parse(self, response):
        emails = response.xpath("//a[contains(@href, 'mailto:')]/text()").getall()
        yield {
        "a": emails,
        }

Paul M. · Answer 1 · 2021-06-07T15:06:22.283

The e-mail addresses are encoded in a certain way to prevent naive scraping. Here is one such encoded e-mail address:

<p>
    <a href="/cdn-cgi/l/email-protection#3851565e57784b515d4a4a595c5d564c5954165b59074b4d5a525d5b4c056a5d494d5d4b4c1d0a084c504a574d5f501d0a086c504a5d5d7a5d4b4c6a594c5d5c165b59">
        <i class="fa fa-envelope-o"></i>
        <span class="__cf_email__" data-cfemail="70191e161f3003191502021114151e04111c5e1311">[email&#160;protected]</span> 
   </a>
</p>

Which is then decoded using this JavaScript script.

So, your options are:

Reverse-engineer the decoding script
Use some kind of JavaScript runtime to execute the decoding script
If you're going to use a JavaScript runtime, you might as well use Selenium to begin with (there seems to exist a scrapy-selenium middleware that you could use if you want to stick with scrapy)

EDIT - I've reverse-engineered it for fun:

def deobfuscate(string, start_index):

    def extract_hex(string, index):
        substring = string[index: index+2]
        return int(substring, 16)

    key = extract_hex(string, start_index)
    for index in range(start_index+2, len(string), 2):
        yield chr(extract_hex(string, index) ^ key)


def process_tag(tag):
    url_fragment = "/cdn-cgi/l/email-protection#"
    href = tag["href"]
    start_index = href.find(url_fragment)
    if start_index > -1:
        return "".join(deobfuscate(href, start_index + len(url_fragment)))
    return None

def main():

    import requests
    from bs4 import BeautifulSoup as Soup
    from urllib.parse import unquote

    url = "https://threebestrated.ca/children-dentists-in-airdrie-ab"

    response = requests.get(url)
    response.raise_for_status()

    soup = Soup(response.content, "html.parser")

    print("E-Mail Addresses from <a> tags:")
    for email in map(unquote, filter(None, map(process_tag, soup.find_all("a", href=True)))):
        print(email)

    cf_elem_attr = "data-cfemail"

    print("\nE-Mail Addresses from tags where \"{}\" attribute is present:".format(cf_elem_attr))
    for tag in soup.find_all(attrs={cf_elem_attr:True}):
        email = unquote("".join(deobfuscate(tag[cf_elem_attr], 0)))
        print(email)
        

if __name__ == "__main__":
    import sys
    sys.exit(main())

Output:

E-Mail Addresses from <a> tags:
info@sierradental.ca?subject=Request through ThreeBestRated.ca
reviews@threebestrated.ca?subject=My Review for Dr. Amin Salmasi in Airdrie
info@mainstreetdentalairdrie.ca?subject=Request through ThreeBestRated.ca
reviews@threebestrated.ca?subject=My Review for Dr. James Yue in Airdrie
friends@toothpals.ca?subject=Request through ThreeBestRated.ca
reviews@threebestrated.ca?subject=My Review for Dr. Christine Bell in Airdrie
support@threebestrated.ca

E-Mail Addresses from tags where "data-cfemail" attribute is present:
info@sierradental.ca
friends@toothpals.ca
support@threebestrated.ca
>>>

Hi @Paul, How can we use Javascript runtime to execute decoding script? — Kashif Alamdar, Jun 07 '21 at 13:56
@KashifAlamdar Hi - I was going to suggest NodeJS. However, I've decided to reverse-engineer the script. I've edited my answer - take a look. — Paul M., Jun 07 '21 at 15:07

Scraping data not in view page source using python scrapy

1 Answers1