I am wring a scrapy spider crawling a dynamic website with selenium chrome web driver. But recently I found that my spider started being blocked by the website. My code only downloads one or two pages when I ran my code for code test and debugging. The printed page source is as follows:
<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml"><head>
<meta name="ROBOTS" content="NOINDEX, NOFOLLOW" />
<meta http-equiv="cache-control" content="max-age=0" />
<meta http-equiv="cache-control" content="no-cache" />
<meta http-equiv="expires" content="0" />
<meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT" />
<meta http-equiv="pragma" content="no-cache" />
<meta http-equiv="refresh" content="10; url=/distil_r_captcha.html?requestId=73bf85fe-a3e7-4dc8-9284-1605c4cd82f3&httpReferrer=%2Fmyytavat-uudisasunnot%3FcardType%3D100%26locations%3D%255B%2522helsinki%2522%255D%26newDevelopment%3D1%26buildingType%255B%255D%3D1%26buildingType%255B%255D%3D256%26pagination%3D1" />
<script type="text/javascript">
(function(window){
try {
if (typeof sessionStorage !== 'undefined'){
sessionStorage.setItem('distil_referrer', document.referrer);
}
} catch (e){}
})(window);
</script>
<script type="text/javascript" src="/dstlsnm.js" defer=""></script><style type="text/css">#d__fFH{position:absolute;top:-5000px;left:-5000px}#d__fF{font-family:serif;font-size:200px;}#dfdretxfwc{display:none!important}</style></head>
<body>
<div id="distilIdentificationBlock"> </div>
<div id="d__fFH" style="position: absolute; top: -5000px; left: -5000px;"><object id="d_dlg" classid="clsid:3050f819-98b5-11cf-bb82-00aa00bdce0b" width="0px" height="0px"></object><span id="d__fF" style="font-family: Courier, serif; font-size: 72px; ">The quick brown fox jumps over the lazy dog.</span></div></body></html>
it's wired that I am able to browse the web while my spider is blocked. The spider is blocked for about fifteen minutes and it is able to download the page source again.
What I was trying to do is to add user agent, like this:
chromeOptions.add_argument('--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36')
self.driver = webdriver.Chrome(executable_path=chrome_driver, chrome_options=chromeOptions)
but it does't seem to get over this issue(but still I don't understand the sense of it, because selenium is using chrome to access the web page source, there should be the 'user-agent' parameter within its default settings). Any recommendation that how the web server can recognize my spider while it does not download huge amount of pages?