1

After searching several hours on stack overflow and other pages I wasn't able to find any solution to my problem yet. I would like to scrape thru the page https://www.bstn.com/eu_de, via Python Selenium and ChromeDriver.

When visiting the page with a normal browser like Firefox or Chrome it opens without any issues. However, when using Selenium it gets a white screen page back. My script already includes the standard procedures found on StackOverflow hundreds of times:

options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled')

I'm also using rotating and updating user agents on every request.

Further investigation has shown the server seems to throw a 429 error. Normally 429 states that there are too many requests that had been made, but since I've only tried it less than 10 times and on the normal browsers it still works this doesn't seem to be the problem.

Another look at Chromes Network -> Headers tab shows that the server throwing the 429 error is Cloudflare so it seems that Cloudflare is involved in any way. I've compared the Request Headers of a successful connection (Right on picture) and a 429 Error connection on the left. Headers comparison

The only thing that is different is a slightly larger cookie set (all cookies were deleted before the request where made), a referer header, the sec-fetch-site value containing same-origin, and sec-fetch-user: ?1 . Adding/changing this header information with a tool called selenium wire, doesn't seem to affect the problem I'm facing in any kind of way.

I also could identify a request cookie: "name":"KP_REF","domain":"www.bstn.com","value":"" being created on the normal browser and doesn't exists when using Selenium. Adding:

driver.add_cookie({"name":"KP_REF","domain":"www.bstn.com","value":""})

also doesn't change anything.

What am I missing or doing wrong to be able to access this page? I'm also not using Chrome headless so far and I depend on using ChromeDriver, as this is the standard inside of our application. I also insist on ChromeDriver as ChromeDriverManager doesn't seem to work with undetected-ChromeDriver.

Gordian
  • 101
  • 8

1 Answers1

0

HTTP 429 Error

HTTP 429 Error is returned when a user has sent too many requests within a short period of time. The 429 status code is intended for use with rate-limiting schemes.

In real time usecases if the AUT(Application under Test) detects that a user agent is trying to access a specific page too often in a short period of time, it triggers a rate-limiting feature. The most common phenomenon of this circumstance is when an attacker repeatedly tries to log into your site.

However, the application server may also identify users with cookies, rather than by their login credentials. Requests may also be counted on a per-request basis, across your server, or across several servers. So there are a variety of situations that can result in you seeing an error like one of these:

  • 429 Too Many Requests
  • 429 Error
  • HTTP 429
  • Error 429 (Too Many Requests)

Examples

A couple of examples:

HTTP/1.1 429 Too Many Requests
Content-type: text/html
Retry-After: 3600

and

<html>
    <head>
        <title>Too Many Requests</title>
    </head>
    <body>
        <h1>Too Many Requests</h1>
        <p>I only allow 100 requests per hour to this website per logged in user. Try again soon. </p>
    </body>
</html>

Remediation

At times this problem can go away on its own. However, in some cases plugin issues or Denial of Service (DDoS) attacks can also causes this error which you need to address individually and should be a different discussion all together.


Outro

Finally for more clarity and brevity it is to be noted that the webpage is protected from the using hCaptcha

hCaptcha

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • Hi, I already have most of this information, as you already posted a similar post on another so-topic. However, as I've already explained explicitly in detail above the reason for the 429 Error in this case is not a page viewed or opened too often. https://www.bstn.com/eu_de opens flawlessly on browsers not automated (so that run on a normal basis). Actual, when running normal I would be able to run near to a hundred requests per day from just one IP without any problems. Nevertheless, as soon as I'm approaching via Selenium I'm running into this problem. – Gordian Mar 25 '22 at 21:51
  • What my intention with this post is if somebody has an idea or at least a clue what is blocking my python selenium script, as this is a really wired problem. I've taken care to set up Selenium and ChromeDriver in a way that they won't get detected like shown above. The user agent rotates every time a request is sent, so it's impossible they blocking me because of that. Did I miss anything in the page header? Also the wired thing is that the connection seems to be blocked by cloudflare but then Cloudflare throws an 429 error... – Gordian Mar 25 '22 at 21:56
  • Also, I get a blank white screen, so no typical Cloudflare access page, but when visiting them via the IP of the webserver I get a Cloudflare connection blocked page so it's sure they are using it. Can anyone reproduce this and maybe has a solution? Thanks in advance! – Gordian Mar 25 '22 at 22:00
  • Checkout the updated answer and let me know your thoughts. – undetected Selenium Mar 25 '22 at 22:06
  • Yes, the webpage may be protected by captchas, but I think that is not what this post / my question is about... – Gordian Mar 25 '22 at 23:06