Any option to bypass Incapsula protection in python3 while scraping?

Question

I'm new in scraping, and I'm already blocked by the Incapsula protection.

import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://www.immoweb.be/fr/recherche/immeuble-de-rapport/a-vendre'

# opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

#html parsing
page_soup = soup(page_html, "html.parser")

page_soup.h1

I can't access any data from the website because I'm blocked by the InCapsula problem...
When I type :

print(page_soup)

I get this message:

<html style="height:100%"><head><meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/><meta content="telephone=no" name="format-detection"/>
[...]
Request unsuccessful. Incapsula incident ID: 936002200207012991-

Possible duplicate of [Why is my function 'page\_soup.h1' returning empty result?](https://stackoverflow.com/questions/56164404/why-is-my-function-page-soup-h1-returning-empty-result) — sentence, May 16 '19 at 13:33

Kafels · Accepted Answer · 2019-05-17T12:54:59.517

I did some tests described here Getting ‘wrong’ page source when calling url from python and only the workaround of @Karl Anka worked out.

See the example below:

from bs4 import BeautifulSoup
from selenium import webdriver

url = 'https://www.immoweb.be/fr/recherche/immeuble-de-rapport/a-vendre'

driver = webdriver.Chrome(executable_path='./chromedriver')
driver.get(url)

soup = BeautifulSoup(driver.page_source, features='html.parser')
driver.quit()

print(soup.prettify())

Output:

<html class="js flexbox rgba borderradius boxshadow opacity cssgradients csstransitions generatedcontent localstorage sessionstorage" style="" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/">
 <head>
  <script async="" src="https://c.pebblemedia.be/js/data/david/_david_publishers_master_produpress.js" type="text/javascript">
  </script>
  <script async="" src="https://scdn.cxense.com/cx.js" type="text/javascript">
  </script>
  <script async="" src="https://connect.facebook.net/signals/plugins/inferredEvents.js?v=2.8.47">
  </script>
  <script async="" src="https://connect.facebook.net/signals/config/1554445828209863?v=2.8.47&amp;r=stable">
  </script>
[...]

Thank you for your help ! I got the same output as you. Now I need to learn how to extract the different informations from the different real estate property in the pages. — mr-kim, May 17 '19 at 00:47
The thing is, for the moment, tu best to do is to use Selenium and Fake_agent to bypass incapsula — mr-kim, May 17 '19 at 07:51
Also try to use proxies along with selenium. InCapsula using browser fingerprinting techniques also — Dhamodharan, May 20 '19 at 09:10

Any option to bypass Incapsula protection in python3 while scraping?

1 Answers1