0

I've been trying to scrape this page (https://www.riachuelo.com.br/feminino/colecao-feminino) with Selenium but I can´t manage to access the html because it never loads. I've tried using random user agents and other browsers, but the problem persists. Any ideas why is this happening?

Here is the code:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from fake_useragent import UserAgent
URL = "https://www.riachuelo.com.br/feminino/colecao-feminino"
options = Options()
ua = UserAgent()
userAgent = ua.random
options.add_argument(f'user-agent={userAgent}')
driver = webdriver.Chrome(chrome_options=options,executable_path=r"C:\Program Files (x86)\chromedriver.exe")
driver.get(URL)

Espuky
  • 33
  • 5

1 Answers1

0

I executed your usecase to load the webpage at https://www.riachuelo.com.br/feminino/colecao-feminino using Selenium as follows:

from selenium import webdriver

options = webdriver.ChromeOptions() 
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get('https://www.riachuelo.com.br/feminino/colecao-feminino')

Similarly, as per your observation I have hit the same roadblock that the webpage never loads.:

riachuelo


Analysis

While inspecting the DOM Tree of the webpage you will find that some of the <iframe>, <script> tag refers to the keyword dist. As an example:

  • src="https://dtbot.directtalk.com.br/1.0/staticbot/dist/js/../index.html#!/?token=c243ce95-db6c-4ab6-9f2b-bf60d69c2d3d&widget=true&top=40&text=Alguma%20d%C3%BAvida%3F&textcolor=ffffff&bgcolor=4E1D3A&from=bottomRigth"
  • <script id="dtbot-script" src="https://dtbot.directtalk.com.br/1.0/staticbot/dist/js/dtbot.js?token=c243ce95-db6c-4ab6-9f2b-bf60d69c2d3d&amp;widget=true&amp;top=40&amp;text=Alguma%20d%C3%BAvida%3F&amp;textcolor=ffffff&amp;bgcolor=4E1D3A&amp;from=bottomRigth"></script>

Which is a clear indication that the website is protected by Bot Management service provider Distil Networks and the navigation by ChromeDriver gets detected and subsequently blocked.


Distil

As per the article There Really Is Something About Distil.it...:

Distil protects sites against automatic content scraping bots by observing site behavior and identifying patterns peculiar to scrapers. When Distil identifies a malicious bot on one site, it creates a blacklisted behavioral profile that is deployed to all its customers. Something like a bot firewall, Distil detects patterns and reacts.

Further,

"One pattern with Selenium was automating the theft of Web content", Distil CEO Rami Essaid said in an interview last week. "Even though they can create new bots, we figured out a way to identify Selenium the a tool they're using, so we're blocking Selenium no matter how many times they iterate on that bot. We're doing that now with Python and a lot of different technologies. Once we see a pattern emerge from one type of bot, then we work to reverse engineer the technology they use and identify it as malicious".


Reference

You can find a couple of detailed discussion in:

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352