This is the website I am crawling. I had no problem at first, but then I encountered this error.
[scrapy] DEBUG: Redirecting (meta refresh) to <GET https://www.propertyguru.com.my/distil_r_captcha.html?requestId=9f8ba25c-3673-40d3-bfe2-6e01460be915&httpReferrer=%2Fproperty-for-rent%2F1> from <GET https://www.propertyguru.com.my/property-for-rent/1>
Website knows I am a bot and redirects me to a page with a captcha code. I think handle_httpstatus_list
or dont_redirect
doesn't work because redirection isn't done with http status codes. This is my crawler's code. Is there any way to stop this redirection?
class MySpider(CrawlSpider):
name = 'myspider'
start_urls = [
'https://www.propertyguru.com.my/property-for-rent/1',
]
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}
meta = {
'dont_redirect': True
}
def parse(self, response):
items = response.css('div.header-container h3.ellipsis a.nav-link::attr(href)').getall()
if items:
for item in items:
if item.startswith('/property-listing/'):
yield scrapy.Request(
url='https://www.propertyguru.com.my{}'.format(item),
method='GET',
headers=self.headers,
meta=self.meta,
callback=self.parse_items
)
def parse_items(self, response):
from scrapy.shell import inspect_response
inspect_response(response, self)
UPDATE: I tried those settings, but they didn't work either.
custom_settings = {
'DOWNLOAD_DELAY': 5,
'DOWNLOAD_TIMEOUT': 360,
'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
'CONCURRENT_ITEMS': 1,
'REDIRECT_MAX_METAREFRESH_DELAY': 200,
'REDIRECT_MAX_TIMES': 40,
}