I need help on problem for a real estate website where the site would return a 429 http error on my first visit or scrape through a fresh IP/proxy. This makes my post a novel problem as all other posts addresses the issue by slowing down its request or implementing a proxy. I'll illustrate it by visiting the page for the first time on a fresh IP on chrome incognito with devtools:
The image shows the network developer tool on Chrome with a fresh IP, where I received an error 429 for the original request (1-adam-ct-gladstone) followed by a bunch of requests until finally I received the intended response document successfully. Bottom line is despite the error, chrome proceeds to load the page successfully anyways. Through scrapy-playwright, it received the 429 error and managed to send requests up to edge.fullstory.com but stops there and hence fails to scrape the page. I'm relatively new to web scraping, I'm wondering why it doesn't continue the series of requests up to the actual document response and in what direction should i look into coding scrapy to achieve the response I need.
Main code:
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from scrapeRealestate.spiders.realestateSpider import RealEstateSpider
from scrapy.utils.log import configure_logging
from scrapy.utils.reactor import install_reactor
install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor')
from twisted.internet import reactor
def main():
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner(get_project_settings())
d = runner.crawl(RealEstateSpider)
d.addBoth(lambda _: reactor.stop())
reactor.run()
if __name__ == '__main__':
main()
Spider code:
import scrapy
import os
class RealEstateSpider(scrapy.Spider):
name = 'realestate'
def start_requests(self):
save_path = '/mnt/e/Python/Scrapy/RealEstate'
url = 'https://www.realestate.com.au/property/1-adam-ct-gladstone-park-vic-3043'
yield scrapy.Request(
url=url,
cb_kwargs={'save_path': save_path},
meta={'playwright': True}
)
def parse(self, response, save_path):
filename = os.path.join(save_path, 'testid' + '.html')
with open(filename, 'wb') as html_file:
html_file.write(response.body)
Log:
INFO: Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
'AUTOTHROTTLE_START_DELAY': 1,
'BOT_NAME': 'scrapeRealestate',
'NEWSPIDER_MODULE': 'scrapeRealestate.spiders',
'SPIDER_MODULES': ['scrapeRealestate.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor',
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36'}
2022-05-17 18:22:30 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.throttle.AutoThrottle']
2022-05-17 18:22:30 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-05-17 18:22:30 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-05-17 18:22:30 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-05-17 18:22:30 [scrapy.core.engine] INFO: Spider opened
2022-05-17 18:22:30 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-05-17 18:22:30 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-05-17 18:22:30 [scrapy-playwright] INFO: Launching browser
2022-05-17 18:22:30 [scrapy-playwright] INFO: Launching browser
2022-05-17 18:22:31 [scrapy-playwright] INFO: Browser chromium launched
2022-05-17 18:22:31 [scrapy-playwright] INFO: Browser chromium launched
2022-05-17 18:22:37 [scrapy-playwright] DEBUG: Browser context started: 'default'
2022-05-17 18:22:37 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2022-05-17 18:22:37 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://www.realestate.com.au/property/1-adelaide-bvd-gowanbrae-vic-3043> (resource type: document, referrer: None)
2022-05-17 18:22:37 [scrapy-playwright] DEBUG: [Context=default] Response: <429 https://www.realestate.com.au/property/1-adelaide-bvd-gowanbrae-vic-3043> (referrer: None)
2022-05-17 18:22:37 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://www.realestate.com.au/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/ips.js?KP2_UIDz=025xiKQqTztJDGchBXr3fgDICO9Sgo7CxWdkVJVp9ASfql68djNEXKHvfAyeqUXraZ98H2JC4ehP0SK91IzXKCyIQdZ1lqtKHlErALzHJOeJOmXxYryDhPNFrs4yxgaFHQgMecNX9su5xMlLfg3X9LvrPrO> (resource type: script, referrer: https://www.realestate.com.au/property/1-adelaide-bvd-gowanbrae-vic-3043)
2022-05-17 18:22:38 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://www.realestate.com.au/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/ips.js?KP2_UIDz=025xiKQqTztJDGchBXr3fgDICO9Sgo7CxWdkVJVp9ASfql68djNEXKHvfAyeqUXraZ98H2JC4ehP0SK91IzXKCyIQdZ1lqtKHlErALzHJOeJOmXxYryDhPNFrs4yxgaFHQgMecNX9su5xMlLfg3X9LvrPrO> (referrer: None)
2022-05-17 18:22:38 [scrapy-playwright] DEBUG: [Context=default] Request: <POST https://reporting.cdndex.io/error> (resource type: xhr, referrer: https://www.realestate.com.au/)
2022-05-17 18:22:38 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://reporting.cdndex.io/error> (referrer: None)
2022-05-17 18:22:39 [scrapy-playwright] DEBUG: [Context=default] Request: <POST https://reporting.cdndex.io/error> (resource type: xhr, referrer: https://www.realestate.com.au/)
2022-05-17 18:22:39 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://reporting.cdndex.io/error> (referrer: None)
2022-05-17 18:22:39 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://edge.fullstory.com/> (resource type: xhr, referrer: https://www.realestate.com.au/)
2022-05-17 18:22:39 [filelock] DEBUG: Attempting to acquire lock 140184252631648 on /home/tk/.cache/python-tldextract/3.9.5.final__webscrape__b287f6__tldextract-3.3.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-05-17 18:22:39 [filelock] DEBUG: Lock 140184252631648 acquired on /home/tk/.cache/python-tldextract/3.9.5.final__webscrape__b287f6__tldextract-3.3.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-05-17 18:22:39 [filelock] DEBUG: Attempting to release lock 140184252631648 on /home/tk/.cache/python-tldextract/3.9.5.final__webscrape__b287f6__tldextract-3.3.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-05-17 18:22:39 [filelock] DEBUG: Lock 140184252631648 released on /home/tk/.cache/python-tldextract/3.9.5.final__webscrape__b287f6__tldextract-3.3.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-05-17 18:22:39 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.realestate.com.au/property/1-adelaide-bvd-gowanbrae-vic-3043> (failed 1 times): 429 Unknown Status
2022-05-17 18:22:40 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2022-05-17 18:22:40 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://www.realestate.com.au/property/1-adelaide-bvd-gowanbrae-vic-3043> (resource type: document, referrer: None)
2022-05-17 18:22:40 [scrapy-playwright] DEBUG: [Context=default] Response: <429 https://www.realestate.com.au/property/1-adelaide-bvd-gowanbrae-vic-3043> (referrer: None)
2022-05-17 18:22:40 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://www.realestate.com.au/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/ips.js?KP2_UIDz=03vDWxNWetBJ7uHgYKHg19iDTjBtyNRvKxUSlzejF97Sv82yaow30Yl5rreuRO5xJ4djIEjbVsCdJ1ilXgXzGBvpUhY2GjdHpy2xXqhomRzcstmqnRFXM5sYpN97I1pdJo3OS6fzoP20GHORAekIeLKpyIf> (resource type: script, referrer: https://www.realestate.com.au/property/1-adelaide-bvd-gowanbrae-vic-3043)
2022-05-17 18:22:41 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://www.realestate.com.au/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/ips.js?KP2_UIDz=03vDWxNWetBJ7uHgYKHg19iDTjBtyNRvKxUSlzejF97Sv82yaow30Yl5rreuRO5xJ4djIEjbVsCdJ1ilXgXzGBvpUhY2GjdHpy2xXqhomRzcstmqnRFXM5sYpN97I1pdJo3OS6fzoP20GHORAekIeLKpyIf> (referrer: None)
2022-05-17 18:22:41 [scrapy-playwright] DEBUG: [Context=default] Request: <POST https://reporting.cdndex.io/error> (resource type: xhr, referrer: https://www.realestate.com.au/)
2022-05-17 18:22:41 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://reporting.cdndex.io/error> (referrer: None)
2022-05-17 18:22:42 [scrapy-playwright] DEBUG: [Context=default] Request: <POST https://reporting.cdndex.io/error> (resource type: xhr, referrer: https://www.realestate.com.au/)
2022-05-17 18:22:42 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://reporting.cdndex.io/error> (referrer: None)
2022-05-17 18:22:42 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://edge.fullstory.com/> (resource type: xhr, referrer: https://www.realestate.com.au/)
2022-05-17 18:22:42 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.realestate.com.au/property/1-adelaide-bvd-gowanbrae-vic-3043> (failed 2 times): 429 Unknown Status
2022-05-17 18:22:43 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2022-05-17 18:22:43 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://www.realestate.com.au/property/1-adelaide-bvd-gowanbrae-vic-3043> (resource type: document, referrer: None)
2022-05-17 18:22:44 [scrapy-playwright] DEBUG: [Context=default] Response: <429 https://www.realestate.com.au/property/1-adelaide-bvd-gowanbrae-vic-3043> (referrer: None)
2022-05-17 18:22:44 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://www.realestate.com.au/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/ips.js?KP2_UIDz=027f2hIYg3c7BebFEAXNtGvdejxWv86IYBfmix4CvvUrQwTKS7YWFRE2gzR9DOQgGq5d35chrWRUMLiq70rl4bZ2kEd9957mFPXRqV19BGewQKxTmgiA4TrLNP04hbQUYnLvrVYFZWAcfXipASJcnQQH9NP> (resource type: script, referrer: https://www.realestate.com.au/property/1-adelaide-bvd-gowanbrae-vic-3043)
2022-05-17 18:22:44 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://www.realestate.com.au/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/ips.js?KP2_UIDz=027f2hIYg3c7BebFEAXNtGvdejxWv86IYBfmix4CvvUrQwTKS7YWFRE2gzR9DOQgGq5d35chrWRUMLiq70rl4bZ2kEd9957mFPXRqV19BGewQKxTmgiA4TrLNP04hbQUYnLvrVYFZWAcfXipASJcnQQH9NP> (referrer: None)
2022-05-17 18:22:44 [scrapy-playwright] DEBUG: [Context=default] Request: <POST https://reporting.cdndex.io/error> (resource type: xhr, referrer: https://www.realestate.com.au/)
2022-05-17 18:22:45 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://reporting.cdndex.io/error> (referrer: None)
2022-05-17 18:22:45 [scrapy-playwright] DEBUG: [Context=default] Request: <POST https://reporting.cdndex.io/error> (resource type: xhr, referrer: https://www.realestate.com.au/)
2022-05-17 18:22:46 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://reporting.cdndex.io/error> (referrer: None)
2022-05-17 18:22:46 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://edge.fullstory.com/> (resource type: xhr, referrer: https://www.realestate.com.au/)
2022-05-17 18:22:46 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.realestate.com.au/property/1-adelaide-bvd-gowanbrae-vic-3043> (failed 3 times): 429 Unknown Status
2022-05-17 18:22:46 [scrapy.core.engine] DEBUG: Crawled (429) <GET https://www.realestate.com.au/property/1-adelaide-bvd-gowanbrae-vic-3043> (referer: https://www.realestate.com.au/) ['playwright']
2022-05-17 18:22:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <429 https://www.realestate.com.au/property/1-adelaide-bvd-gowanbrae-vic-3043>: HTTP status code is not handled or not allowed
2022-05-17 18:22:46 [scrapy.core.engine] INFO: Closing spider (finished)
2022-05-17 18:22:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1364,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 4478,
'downloader/response_count': 3,
'downloader/response_status_count/429': 3,
'elapsed_time_seconds': 15.783981,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 5, 17, 10, 22, 46, 437676),
'httperror/response_ignored_count': 1,
'httperror/response_ignored_status_count/429': 1,
'log_count/DEBUG': 38,
'log_count/ERROR': 1,
'log_count/INFO': 15,
'memusage/max': 60256256,
'memusage/startup': 60256256,
'playwright/context_count': 1,
'playwright/page_count': 3,
'playwright/page_count/closed': 3,
'playwright/page_count/max_concurrent': 1,
'playwright/request_count': 15,
'playwright/request_count/method/GET': 9,
'playwright/request_count/method/POST': 6,
'playwright/request_count/navigation': 3,
'playwright/request_count/resource_type/document': 3,
'playwright/request_count/resource_type/script': 3,
'playwright/request_count/resource_type/xhr': 9,
'playwright/response_count': 12,
'playwright/response_count/method/GET': 6,
'playwright/response_count/method/POST': 6,
'playwright/response_count/resource_type/document': 3,
'playwright/response_count/resource_type/script': 3,
'playwright/response_count/resource_type/xhr': 6,
'response_received_count': 1,
'retry/count': 2,
'retry/max_reached': 1,
'retry/reason_count/429 Unknown Status': 2,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2022, 5, 17, 10, 22, 30, 653695)}
2022-05-17 18:22:46 [scrapy.core.engine] INFO: Spider closed (finished)
2022-05-17 18:22:46 [scrapy-playwright] INFO: Closing browser
2022-05-17 18:22:46 [scrapy-playwright] INFO: Closing browser
2022-05-17 18:22:46 [scrapy-playwright] DEBUG: Browser context closed: 'default'
Process finished with exit code 0