0

I need help on problem for a real estate website where the site would return a 429 http error on my first visit or scrape through a fresh IP/proxy. This makes my post a novel problem as all other posts addresses the issue by slowing down its request or implementing a proxy. I'll illustrate it by visiting the page for the first time on a fresh IP on chrome incognito with devtools:

Link to devtool image

The image shows the network developer tool on Chrome with a fresh IP, where I received an error 429 for the original request (1-adam-ct-gladstone) followed by a bunch of requests until finally I received the intended response document successfully. Bottom line is despite the error, chrome proceeds to load the page successfully anyways. Through scrapy-playwright, it received the 429 error and managed to send requests up to edge.fullstory.com but stops there and hence fails to scrape the page. I'm relatively new to web scraping, I'm wondering why it doesn't continue the series of requests up to the actual document response and in what direction should i look into coding scrapy to achieve the response I need.

Main code:

from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from scrapeRealestate.spiders.realestateSpider import RealEstateSpider
from scrapy.utils.log import configure_logging
from scrapy.utils.reactor import install_reactor
install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor')
from twisted.internet import reactor


def main():
    configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
    runner = CrawlerRunner(get_project_settings())
    d = runner.crawl(RealEstateSpider)
    d.addBoth(lambda _: reactor.stop())
    reactor.run()


if __name__ == '__main__':
    main()

Spider code:

import scrapy
import os


class RealEstateSpider(scrapy.Spider):
    name = 'realestate'

    def start_requests(self):
        save_path = '/mnt/e/Python/Scrapy/RealEstate'

        url = 'https://www.realestate.com.au/property/1-adam-ct-gladstone-park-vic-3043'
        yield scrapy.Request(
            url=url,
            cb_kwargs={'save_path': save_path},
            meta={'playwright': True}
        )

    def parse(self, response, save_path):
        filename = os.path.join(save_path, 'testid' + '.html')
        with open(filename, 'wb') as html_file:
            html_file.write(response.body)

Log:

INFO: Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
 'AUTOTHROTTLE_START_DELAY': 1,
 'BOT_NAME': 'scrapeRealestate',
 'NEWSPIDER_MODULE': 'scrapeRealestate.spiders',
 'SPIDER_MODULES': ['scrapeRealestate.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor',
 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
               '(KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36'}
2022-05-17 18:22:30 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.throttle.AutoThrottle']
2022-05-17 18:22:30 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-05-17 18:22:30 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-05-17 18:22:30 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-05-17 18:22:30 [scrapy.core.engine] INFO: Spider opened
2022-05-17 18:22:30 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-05-17 18:22:30 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-05-17 18:22:30 [scrapy-playwright] INFO: Launching browser
2022-05-17 18:22:30 [scrapy-playwright] INFO: Launching browser
2022-05-17 18:22:31 [scrapy-playwright] INFO: Browser chromium launched
2022-05-17 18:22:31 [scrapy-playwright] INFO: Browser chromium launched
2022-05-17 18:22:37 [scrapy-playwright] DEBUG: Browser context started: 'default'
2022-05-17 18:22:37 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2022-05-17 18:22:37 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://www.realestate.com.au/property/1-adelaide-bvd-gowanbrae-vic-3043> (resource type: document, referrer: None)
2022-05-17 18:22:37 [scrapy-playwright] DEBUG: [Context=default] Response: <429 https://www.realestate.com.au/property/1-adelaide-bvd-gowanbrae-vic-3043> (referrer: None)
2022-05-17 18:22:37 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://www.realestate.com.au/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/ips.js?KP2_UIDz=025xiKQqTztJDGchBXr3fgDICO9Sgo7CxWdkVJVp9ASfql68djNEXKHvfAyeqUXraZ98H2JC4ehP0SK91IzXKCyIQdZ1lqtKHlErALzHJOeJOmXxYryDhPNFrs4yxgaFHQgMecNX9su5xMlLfg3X9LvrPrO> (resource type: script, referrer: https://www.realestate.com.au/property/1-adelaide-bvd-gowanbrae-vic-3043)
2022-05-17 18:22:38 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://www.realestate.com.au/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/ips.js?KP2_UIDz=025xiKQqTztJDGchBXr3fgDICO9Sgo7CxWdkVJVp9ASfql68djNEXKHvfAyeqUXraZ98H2JC4ehP0SK91IzXKCyIQdZ1lqtKHlErALzHJOeJOmXxYryDhPNFrs4yxgaFHQgMecNX9su5xMlLfg3X9LvrPrO> (referrer: None)
2022-05-17 18:22:38 [scrapy-playwright] DEBUG: [Context=default] Request: <POST https://reporting.cdndex.io/error> (resource type: xhr, referrer: https://www.realestate.com.au/)
2022-05-17 18:22:38 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://reporting.cdndex.io/error> (referrer: None)
2022-05-17 18:22:39 [scrapy-playwright] DEBUG: [Context=default] Request: <POST https://reporting.cdndex.io/error> (resource type: xhr, referrer: https://www.realestate.com.au/)
2022-05-17 18:22:39 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://reporting.cdndex.io/error> (referrer: None)
2022-05-17 18:22:39 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://edge.fullstory.com/> (resource type: xhr, referrer: https://www.realestate.com.au/)
2022-05-17 18:22:39 [filelock] DEBUG: Attempting to acquire lock 140184252631648 on /home/tk/.cache/python-tldextract/3.9.5.final__webscrape__b287f6__tldextract-3.3.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-05-17 18:22:39 [filelock] DEBUG: Lock 140184252631648 acquired on /home/tk/.cache/python-tldextract/3.9.5.final__webscrape__b287f6__tldextract-3.3.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-05-17 18:22:39 [filelock] DEBUG: Attempting to release lock 140184252631648 on /home/tk/.cache/python-tldextract/3.9.5.final__webscrape__b287f6__tldextract-3.3.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-05-17 18:22:39 [filelock] DEBUG: Lock 140184252631648 released on /home/tk/.cache/python-tldextract/3.9.5.final__webscrape__b287f6__tldextract-3.3.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-05-17 18:22:39 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.realestate.com.au/property/1-adelaide-bvd-gowanbrae-vic-3043> (failed 1 times): 429 Unknown Status
2022-05-17 18:22:40 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2022-05-17 18:22:40 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://www.realestate.com.au/property/1-adelaide-bvd-gowanbrae-vic-3043> (resource type: document, referrer: None)
2022-05-17 18:22:40 [scrapy-playwright] DEBUG: [Context=default] Response: <429 https://www.realestate.com.au/property/1-adelaide-bvd-gowanbrae-vic-3043> (referrer: None)
2022-05-17 18:22:40 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://www.realestate.com.au/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/ips.js?KP2_UIDz=03vDWxNWetBJ7uHgYKHg19iDTjBtyNRvKxUSlzejF97Sv82yaow30Yl5rreuRO5xJ4djIEjbVsCdJ1ilXgXzGBvpUhY2GjdHpy2xXqhomRzcstmqnRFXM5sYpN97I1pdJo3OS6fzoP20GHORAekIeLKpyIf> (resource type: script, referrer: https://www.realestate.com.au/property/1-adelaide-bvd-gowanbrae-vic-3043)
2022-05-17 18:22:41 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://www.realestate.com.au/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/ips.js?KP2_UIDz=03vDWxNWetBJ7uHgYKHg19iDTjBtyNRvKxUSlzejF97Sv82yaow30Yl5rreuRO5xJ4djIEjbVsCdJ1ilXgXzGBvpUhY2GjdHpy2xXqhomRzcstmqnRFXM5sYpN97I1pdJo3OS6fzoP20GHORAekIeLKpyIf> (referrer: None)
2022-05-17 18:22:41 [scrapy-playwright] DEBUG: [Context=default] Request: <POST https://reporting.cdndex.io/error> (resource type: xhr, referrer: https://www.realestate.com.au/)
2022-05-17 18:22:41 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://reporting.cdndex.io/error> (referrer: None)
2022-05-17 18:22:42 [scrapy-playwright] DEBUG: [Context=default] Request: <POST https://reporting.cdndex.io/error> (resource type: xhr, referrer: https://www.realestate.com.au/)
2022-05-17 18:22:42 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://reporting.cdndex.io/error> (referrer: None)
2022-05-17 18:22:42 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://edge.fullstory.com/> (resource type: xhr, referrer: https://www.realestate.com.au/)
2022-05-17 18:22:42 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.realestate.com.au/property/1-adelaide-bvd-gowanbrae-vic-3043> (failed 2 times): 429 Unknown Status
2022-05-17 18:22:43 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2022-05-17 18:22:43 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://www.realestate.com.au/property/1-adelaide-bvd-gowanbrae-vic-3043> (resource type: document, referrer: None)
2022-05-17 18:22:44 [scrapy-playwright] DEBUG: [Context=default] Response: <429 https://www.realestate.com.au/property/1-adelaide-bvd-gowanbrae-vic-3043> (referrer: None)
2022-05-17 18:22:44 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://www.realestate.com.au/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/ips.js?KP2_UIDz=027f2hIYg3c7BebFEAXNtGvdejxWv86IYBfmix4CvvUrQwTKS7YWFRE2gzR9DOQgGq5d35chrWRUMLiq70rl4bZ2kEd9957mFPXRqV19BGewQKxTmgiA4TrLNP04hbQUYnLvrVYFZWAcfXipASJcnQQH9NP> (resource type: script, referrer: https://www.realestate.com.au/property/1-adelaide-bvd-gowanbrae-vic-3043)
2022-05-17 18:22:44 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://www.realestate.com.au/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/ips.js?KP2_UIDz=027f2hIYg3c7BebFEAXNtGvdejxWv86IYBfmix4CvvUrQwTKS7YWFRE2gzR9DOQgGq5d35chrWRUMLiq70rl4bZ2kEd9957mFPXRqV19BGewQKxTmgiA4TrLNP04hbQUYnLvrVYFZWAcfXipASJcnQQH9NP> (referrer: None)
2022-05-17 18:22:44 [scrapy-playwright] DEBUG: [Context=default] Request: <POST https://reporting.cdndex.io/error> (resource type: xhr, referrer: https://www.realestate.com.au/)
2022-05-17 18:22:45 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://reporting.cdndex.io/error> (referrer: None)
2022-05-17 18:22:45 [scrapy-playwright] DEBUG: [Context=default] Request: <POST https://reporting.cdndex.io/error> (resource type: xhr, referrer: https://www.realestate.com.au/)
2022-05-17 18:22:46 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://reporting.cdndex.io/error> (referrer: None)
2022-05-17 18:22:46 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://edge.fullstory.com/> (resource type: xhr, referrer: https://www.realestate.com.au/)
2022-05-17 18:22:46 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.realestate.com.au/property/1-adelaide-bvd-gowanbrae-vic-3043> (failed 3 times): 429 Unknown Status
2022-05-17 18:22:46 [scrapy.core.engine] DEBUG: Crawled (429) <GET https://www.realestate.com.au/property/1-adelaide-bvd-gowanbrae-vic-3043> (referer: https://www.realestate.com.au/) ['playwright']
2022-05-17 18:22:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <429 https://www.realestate.com.au/property/1-adelaide-bvd-gowanbrae-vic-3043>: HTTP status code is not handled or not allowed
2022-05-17 18:22:46 [scrapy.core.engine] INFO: Closing spider (finished)
2022-05-17 18:22:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1364,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 4478,
 'downloader/response_count': 3,
 'downloader/response_status_count/429': 3,
 'elapsed_time_seconds': 15.783981,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 5, 17, 10, 22, 46, 437676),
 'httperror/response_ignored_count': 1,
 'httperror/response_ignored_status_count/429': 1,
 'log_count/DEBUG': 38,
 'log_count/ERROR': 1,
 'log_count/INFO': 15,
 'memusage/max': 60256256,
 'memusage/startup': 60256256,
 'playwright/context_count': 1,
 'playwright/page_count': 3,
 'playwright/page_count/closed': 3,
 'playwright/page_count/max_concurrent': 1,
 'playwright/request_count': 15,
 'playwright/request_count/method/GET': 9,
 'playwright/request_count/method/POST': 6,
 'playwright/request_count/navigation': 3,
 'playwright/request_count/resource_type/document': 3,
 'playwright/request_count/resource_type/script': 3,
 'playwright/request_count/resource_type/xhr': 9,
 'playwright/response_count': 12,
 'playwright/response_count/method/GET': 6,
 'playwright/response_count/method/POST': 6,
 'playwright/response_count/resource_type/document': 3,
 'playwright/response_count/resource_type/script': 3,
 'playwright/response_count/resource_type/xhr': 6,
 'response_received_count': 1,
 'retry/count': 2,
 'retry/max_reached': 1,
 'retry/reason_count/429 Unknown Status': 2,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2022, 5, 17, 10, 22, 30, 653695)}
2022-05-17 18:22:46 [scrapy.core.engine] INFO: Spider closed (finished)
2022-05-17 18:22:46 [scrapy-playwright] INFO: Closing browser
2022-05-17 18:22:46 [scrapy-playwright] INFO: Closing browser
2022-05-17 18:22:46 [scrapy-playwright] DEBUG: Browser context closed: 'default'

Process finished with exit code 0

CTK
  • 11
  • 3
  • Added scrapy log – CTK May 17 '22 at 10:25
  • Logs aren't code. Just because you managed to scrape a site once doesn't mean you won't get blocked or banned if you try again too soon. The dev tools image shows that you're *really* receiving 429 errors – Panagiotis Kanavos May 17 '22 at 10:32
  • This isn't a novel problem in any way. Web sites hate web scraping because they're forced to pay to handle traffic for no reason. That can be *very* expensive. If you call the same site too fast you could get throttled or even blocked. Sometimes a 429 error will contain the limit information in rate limiting headers [as shown in this question](https://stackoverflow.com/questions/16022624/examples-of-http-api-rate-limiting-http-response-headers) – Panagiotis Kanavos May 17 '22 at 10:36
  • Please note that I've repeatedly phrased that these results are obtained after resetting a new IP address. Meaning the 429 error is showing up even on the first attempts to access the website via chrome, or via scrapy. That's the novelty of my problem. You may try accessing the webpage and inspect for your first time, you'll get it too. – CTK May 17 '22 at 10:44
  • Added code as well – CTK May 17 '22 at 10:52

0 Answers0