5

I am trying to build a scraper using selenium in python. Selenium webdriver opening window and trying to load the page but suddenly stop loading. I can access the same link in my local chrome browser.

Here are the error logs I'm getting from the webdriver:

{'level': 'SEVERE', 'message': 'https://shop.coles.com.au/a/a-nsw-metro-rouse-hill/everything/browse/baby/nappies-changing?pageNumber=1 - Failed to load resource: the server responded with a status of 429 (Too Many Requests)', 'source': 'network', 'timestamp': 1556997743637}

{'level': 'SEVERE', 'message': 'about:blank - Failed to load resource: net::ERR_UNKNOWN_URL_SCHEME', 'source': 'network', 'timestamp': 1556997745338}

{'level': 'SEVERE', 'message': 'https://shop.coles.com.au/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint - Failed to load resource: the server responded with a status of 404 (Not Found)', 'source': 'network', 'timestamp': 1556997748339}

My script:

from selenium import webdriver
import os

path = os.path.join(os.getcwd(), 'chromedriver')
driver = webdriver.Chrome(executable_path=path)

links = [
    "https://shop.coles.com.au/a/a-nsw-metro-rouse-hill/everything/browse/baby/nappies-changing?pageNumber=1",
    "https://shop.coles.com.au/a/a-nsw-metro-rouse-hill/everything/browse/baby/baby-accessories?pageNumber=1",
    "https://shop.coles.com.au/a/a-nsw-metro-rouse-hill/everything/browse/baby/food?pageNumber=1",
    "https://shop.coles.com.au/a/a-nsw-metro-rouse-hill/everything/browse/baby/formula?pageNumber=1",
]


for link in links:
    driver.get(link)
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
Atik R
  • 133
  • 1
  • 2
  • 11
  • 1
    Please add your Chrome, Chromedriver and Selenium versions. Do you see the first page loading or is the automated Chrome stuck at the first link too? Have you tried other domains (a server might have some throttling installed preventing you from requesting too quickly)? Try [gather Chromedriver logs](http://chromedriver.chromium.org/logging) and inspect them for errors. – try-catch-finally May 04 '19 at 06:42
  • Hi thanks for quick replay. My chrome version- 74, selenium version- 3.14.0 and webdriver version-74. Browser does not stucked at first link, its automating all links but no page loaded completley, its white blank window. – Atik R May 04 '19 at 07:17
  • I can see there are 4 URLs, how do you want to automate that ? – cruisepandey May 04 '19 at 07:46
  • I wnat to get all product links form provided URLs. @cruisepandey – Atik R May 04 '19 at 08:01
  • 1
    Yes, Other webpages are loading @try-catch-finally – Atik R May 04 '19 at 08:02
  • You need to update the question with the trace logs to know the reason and solution why _suddenly stop loading_. – undetected Selenium May 04 '19 at 16:09
  • I have added logs from the webdriver @DebanjanB – Atik R May 04 '19 at 19:28

1 Answers1

2

429 Too Many Requests

The HTTP 429 Too Many Requests response status code indicates that the user has sent too many requests in a given amount of time ("rate limiting"). The response representations SHOULD include details explaining the condition, and MAY include a Retry-After header indicating how long to wait before making a new request.

When a server is under attack or just receiving a very large number of requests from a single party, responding to each with a 429 status code will consume resources. Therefore, servers are not required to use the 429 status code; when limiting resource usage, it may be more appropriate to just drop connections, or take other steps.


404 Not Found

The HTTP 404 Not Found client error response code indicates that the server can not find requested resource. In the browser, this means the URL is not recognized. In an API, this can also mean that the endpoint is valid but the resource itself does not exist. Servers may also send this response instead of 403 to hide the existence of a resource from an unauthorized client. This response code is probably the most famous one due to its frequent occurence on the web.

A 404 status code does not indicate whether the resource is temporarily or permanently missing. But if a resource is permanently removed, a 410 (Gone) should be used instead of a 404 status. Additionally, 404 status code is used when the requested resource is not found, whether it doesn't exist or if there was a 401 or 403 that, for security reasons, the service wants to mask.


Analysis

When I tried your code block, I faced similar consequences. If you inspect the DOM Tree of the webpage you will find that quite a few tags are having the keyword dist. As an example:

  • <link rel="shortcut icon" type="image/x-icon" href="/wcsstore/ColesResponsiveStorefrontAssetStore/dist/30e70cfc76bf73d384beffa80ba6cbee/img/favicon.ico">
  • <link rel="stylesheet" href="/wcsstore/ColesResponsiveStorefrontAssetStore/dist/30e70cfc76bf73d384beffa80ba6cbee/css/google/fonts-Source-Sans-Pro.css" type="text/css" media="screen">
  • 'appDir': '/wcsstore/ColesResponsiveStorefrontAssetStore/dist/30e70cfc76bf73d384beffa80ba6cbee/app'

The presence of the term dist is a clear indication that the website is protected by Bot Management service provider Distil Networks and the navigation by ChromeDriver gets detected and subsequently blocked.


Distil

As per the article There Really Is Something About Distil.it...:

Distil protects sites against automatic content scraping bots by observing site behavior and identifying patterns peculiar to scrapers. When Distil identifies a malicious bot on one site, it creates a blacklisted behavioral profile that is deployed to all its customers. Something like a bot firewall, Distil detects patterns and reacts.

Further,

"One pattern with **Selenium** was automating the theft of Web content", Distil CEO Rami Essaid said in an interview last week. "Even though they can create new bots, we figured out a way to identify Selenium the a tool they're using, so we're blocking Selenium no matter how many times they iterate on that bot. We're doing that now with Python and a lot of different technologies. Once we see a pattern emerge from one type of bot, then we work to reverse engineer the technology they use and identify it as malicious".


Reference

You can find a couple of detailed discussion in:

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • 1
    Understood, thanks for the brief information. So, this website is not scrapeable, right?@DebanjanB – Atik R May 06 '19 at 07:13