0

I am wring a scrapy spider crawling a dynamic website with selenium chrome web driver. But recently I found that my spider started being blocked by the website. My code only downloads one or two pages when I ran my code for code test and debugging. The printed page source is as follows:

<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml"><head>
<meta name="ROBOTS" content="NOINDEX, NOFOLLOW" />
<meta http-equiv="cache-control" content="max-age=0" />
<meta http-equiv="cache-control" content="no-cache" />
<meta http-equiv="expires" content="0" />
<meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT" />
<meta http-equiv="pragma" content="no-cache" />
<meta http-equiv="refresh" content="10; url=/distil_r_captcha.html?requestId=73bf85fe-a3e7-4dc8-9284-1605c4cd82f3&amp;httpReferrer=%2Fmyytavat-uudisasunnot%3FcardType%3D100%26locations%3D%255B%2522helsinki%2522%255D%26newDevelopment%3D1%26buildingType%255B%255D%3D1%26buildingType%255B%255D%3D256%26pagination%3D1" />
<script type="text/javascript">
    (function(window){
        try {
            if (typeof sessionStorage !== 'undefined'){
                sessionStorage.setItem('distil_referrer', document.referrer);
            }
        } catch (e){}
    })(window);
</script>
<script type="text/javascript" src="/dstlsnm.js" defer=""></script><style type="text/css">#d__fFH{position:absolute;top:-5000px;left:-5000px}#d__fF{font-family:serif;font-size:200px;}#dfdretxfwc{display:none!important}</style></head>
<body>
<div id="distilIdentificationBlock"> </div>


<div id="d__fFH" style="position: absolute; top: -5000px; left: -5000px;"><object id="d_dlg" classid="clsid:3050f819-98b5-11cf-bb82-00aa00bdce0b" width="0px" height="0px"></object><span id="d__fF" style="font-family: Courier, serif; font-size: 72px; ">The quick brown fox jumps over the lazy dog.</span></div></body></html>

it's wired that I am able to browse the web while my spider is blocked. The spider is blocked for about fifteen minutes and it is able to download the page source again.

What I was trying to do is to add user agent, like this:

chromeOptions.add_argument('--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1)  AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36')
self.driver = webdriver.Chrome(executable_path=chrome_driver, chrome_options=chromeOptions)

but it does't seem to get over this issue(but still I don't understand the sense of it, because selenium is using chrome to access the web page source, there should be the 'user-agent' parameter within its default settings). Any recommendation that how the web server can recognize my spider while it does not download huge amount of pages?

Jimmy
  • 113
  • 3
  • 15
  • How did you find out the spider is blocked? Any relevant part of log? – Tomáš Linhart Jul 12 '17 at 15:38
  • when I printed the driver.page_source, it becomes the code I pasted above. Normally it should be a list of items. – Jimmy Jul 12 '17 at 16:13
  • 2
    Didn't notice the Distil block in the page source, sorry. Check out [this](https://stackoverflow.com/questions/33225947/can-a-website-detect-when-you-are-using-selenium-with-chromedriver#comment69663876_41220267) SO question, might be of help to you. – Tomáš Linhart Jul 12 '17 at 17:37
  • did you manage to fix this issue? – Hirad Roshandel Oct 16 '17 at 16:37
  • not completely, not even with the re-compiled chromium, since I think I am already in the black list. – Jimmy Oct 16 '17 at 16:45
  • Possible duplicate of [Can a website detect when you are using selenium with chromedriver?](https://stackoverflow.com/questions/33225947/can-a-website-detect-when-you-are-using-selenium-with-chromedriver) – Vic Seedoubleyew Jan 30 '18 at 21:34

0 Answers0