-1

I am trying to scrape a website, but I had the problem of 403 forbidden (that means they blocked me), how can I solve this problem?

from bs4 import BeautifulSoup
from selenium import webdriver

options = webdriver.ChromeOptions()
options.headless = True
driver = webdriver.Chrome(options=options)
#url: the website that i wanna scrape
driver.get(url)

html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
print(soup)

I got this error message :

<pre>&lt;html&gt;&lt;head&gt;&lt;title&gt;You have been blocked&lt;/title&gt;&lt;style&gt;#cmsg{animation: A 1.5s;}@keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}&lt;/style&gt;&lt;/head&gt;&lt;body style=&quot;margin:0&quot;&gt;&lt;script async=&quot;&quot; src=&quot;/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3&amp;amp;ns=1&amp;amp;cb=749975105&quot; type=&quot;text/javascript&quot;&gt;&lt;/script&gt;&lt;script&gt;var dd={&apos;cid&apos;:&apos;AHrlqAAAAAMAcW1trsuCoDEAXu-3KQ==&apos;,&apos;hsh&apos;:&apos;53505CB4534F4422CC81E4A9499234&apos;,&apos;t&apos;:&apos;fe&apos;}&lt;/script&gt;&lt;script src=&quot;https://ct.datado.me/c.js&quot;&gt;&lt;/script&gt;&lt;iframe border=&quot;0&quot; frameborder=&quot;0&quot; height=&quot;100%&quot; scrolling=&quot;yes&quot; src=&quot;https://c.datado.me/captcha/?initialCid=AHrlqAAAAAMAcW1trsuCoDEAXu-3KQ%3D%3D&amp;amp;hash=53505CB4534F4422CC81E4A9499234&amp;amp;cid=09ccOuPGIGlqdUvFNJgB7GzPDCFBmdMIU8Ng~E~1M6.&amp;amp;t=fe &quot; style=&quot;height:100vh;&quot; width=&quot;100%&quot;&gt;&lt;/iframe&gt;&lt;script type=&quot;text/javascript&quot;&gt;
//&lt;![CDATA[
(function() {
var _analytics_scr = document.createElement(&apos;script&apos;);
_analytics_scr.type = &apos;text/javascript&apos;; _analytics_scr.async = true; _analytics_scr.src = &apos;/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3&amp;ns=1&amp;cb=749975105&apos;;
var _analytics_elem = document.getElementsByTagName(&apos;script&apos;)[0]; _analytics_elem.parentNode.insertBefore(_analytics_scr, _analytics_elem);
})();
// ]]&gt;
&lt;/script&gt;
&lt;/body&gt;&lt;/html&gt;
</pre>
Adlan Kadri
  • 61
  • 1
  • 9
  • Does this happen even if you don't use headless? Try removing the headless line and see if the problem persists. – Xosrov Jun 15 '19 at 17:03
  • yes, i tried with all cases and with all technologies, i couldn't get the HTML page of the website that i wanna scrape – Adlan Kadri Jun 15 '19 at 17:10
  • 1
    maybe they've blocked your IP. try it with another network or a VPN and see if it repeats. – Xosrov Jun 15 '19 at 18:06
  • Do note that many of us (aka the people you're asking for help) are the folks who run web sites we don't want scraped. :) – Charles Duffy Jun 16 '19 at 00:14

1 Answers1

1

403 Forbidden

The HTTP 403 Forbidden client error status response code indicates that the server have recieved the request but the client is not authorized and does not have access rights to the content.

This status is similar to 401, but in this case, re-authenticating will make no difference. The access is permanently forbidden and tied to the application logic, such as insufficient rights to a resource.


Example response

HTTP/1.1 403 Forbidden 
Date: Sun, 16 June 2019 07:28:00 GMT

Reason

There are a lot many ways for the headless Chrome browser to get detected and some of the main factors includes:

  • User agent
  • Plugins
  • Languages
  • WebGL
  • Browser features
  • Missing image

You can find a detailed discussion in Selenium and non-headless browser keeps asking for Captcha


Solution

A generic solution will be to use a proxy or rotating proxies from the Free Proxy List.

You can find a detailed discussion in Change proxy in chromedriver for scraping purposes

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352