Crawling web using selenium chrome driver, but still blocked

Question

I am wring a scrapy spider crawling a dynamic website with selenium chrome web driver. But recently I found that my spider started being blocked by the website. My code only downloads one or two pages when I ran my code for code test and debugging. The printed page source is as follows:

<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml"><head>
<meta name="ROBOTS" content="NOINDEX, NOFOLLOW" />
<meta http-equiv="cache-control" content="max-age=0" />
<meta http-equiv="cache-control" content="no-cache" />
<meta http-equiv="expires" content="0" />
<meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT" />
<meta http-equiv="pragma" content="no-cache" />
<meta http-equiv="refresh" content="10; url=/distil_r_captcha.html?requestId=73bf85fe-a3e7-4dc8-9284-1605c4cd82f3&amp;httpReferrer=%2Fmyytavat-uudisasunnot%3FcardType%3D100%26locations%3D%255B%2522helsinki%2522%255D%26newDevelopment%3D1%26buildingType%255B%255D%3D1%26buildingType%255B%255D%3D256%26pagination%3D1" />
<script type="text/javascript">
    (function(window){
        try {
            if (typeof sessionStorage !== 'undefined'){
                sessionStorage.setItem('distil_referrer', document.referrer);
            }
        } catch (e){}
    })(window);
</script>
<script type="text/javascript" src="/dstlsnm.js" defer=""></script><style type="text/css">#d__fFH{position:absolute;top:-5000px;left:-5000px}#d__fF{font-family:serif;font-size:200px;}#dfdretxfwc{display:none!important}</style></head>
<body>
<div id="distilIdentificationBlock"> </div>


<div id="d__fFH" style="position: absolute; top: -5000px; left: -5000px;"><object id="d_dlg" classid="clsid:3050f819-98b5-11cf-bb82-00aa00bdce0b" width="0px" height="0px"></object><span id="d__fF" style="font-family: Courier, serif; font-size: 72px; ">The quick brown fox jumps over the lazy dog.</span></div></body></html>

it's wired that I am able to browse the web while my spider is blocked. The spider is blocked for about fifteen minutes and it is able to download the page source again.

What I was trying to do is to add user agent, like this:

chromeOptions.add_argument('--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1)  AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36')
self.driver = webdriver.Chrome(executable_path=chrome_driver, chrome_options=chromeOptions)

but it does't seem to get over this issue(but still I don't understand the sense of it, because selenium is using chrome to access the web page source, there should be the 'user-agent' parameter within its default settings). Any recommendation that how the web server can recognize my spider while it does not download huge amount of pages?

How did you find out the spider is blocked? Any relevant part of log? — Tomáš Linhart, Jul 12 '17 at 15:38
when I printed the driver.page_source, it becomes the code I pasted above. Normally it should be a list of items. — Jimmy, Jul 12 '17 at 16:13
Didn't notice the Distil block in the page source, sorry. Check out [this](https://stackoverflow.com/questions/33225947/can-a-website-detect-when-you-are-using-selenium-with-chromedriver#comment69663876_41220267) SO question, might be of help to you. — Tomáš Linhart, Jul 12 '17 at 17:37
not completely, not even with the re-compiled chromium, since I think I am already in the black list. — Jimmy, Oct 16 '17 at 16:45
Possible duplicate of [Can a website detect when you are using selenium with chromedriver?](https://stackoverflow.com/questions/33225947/can-a-website-detect-when-you-are-using-selenium-with-chromedriver) — Vic Seedoubleyew, Jan 30 '18 at 21:34

Crawling web using selenium chrome driver, but still blocked

0 Answers0