0

I tried to get contents of this URL - https://www.zillow.com/homedetails/131-Avenida-Dr-Berkeley-CA-94708/24844204_zpid/ I used scrapy. Here is my code.

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'https://www.zillow.com/homedetails/131-Avenida-Dr-Berkeley-CA-94708/24844204_zpid/',
    ]
    def parse(self, response):
        filename = 'test.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

I opened scraped data(test.html) and I got this content. enter image description here I tried to find solutions and I tried this - ERROR for site owner: Invalid domain for site key But it didn't solve my issue.

1 Answers1

0

First of all, try this approach and see if this works:

Headerz = {
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "accept-encoding": "gzip, deflate, br",
    "accept-language": "en-US,en;q=0.9",
    "cache-control": "no-cache",
    "content-type": "application/x-www-form-urlencoded; charset=UTF-8",
    "pragma": "no-cache",
    "sec-fetch-mode": "navigate",
    "sec-fetch-site": "cross-site",
    "sec-fetch-user": "?1",
    "upgrade-insecure-requests": "1",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36",
}

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'https://www.zillow.com/homedetails/131-Avenida-Dr-Berkeley-CA-94708/24844204_zpid/',
    ]

    def start_requests(self):
        yield scrapy.Request(start_urls[0], callback=self.parse, headers=Headerz)

    def parse(self, response):
        filename = 'test.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

The reason why we don't see output as we see it in normal browser is that we don't use proper headers that are otherwise always sent by the browser.

You need to add headers either as stated in above code or by updating them in the settings.py.

A better approach would be to use 'rotating-proxies' respositories along with 'rotating-user-agent' repository.

Janib Soomro
  • 446
  • 6
  • 12