1

I am trying to scrape https://www.hyatt.com and It not for illegal use I just want to make a simple script to find Hotel which matches my search.

But the problem is I am unable to even load the webpage using any bot. It simply does not load.

here are some ways I already tried. 1 - Used selenium 2 - used scrapy frame-work to get the data 3 - used python requests library

from selenium import webdriver

driver = webdriver.Chrome()

driver.get("https://www.hyatt.com")

driver.close()

I just want that the page loads itself. I will take care of the rest.

Ch Usman
  • 209
  • 4
  • 8
  • I think this "var _cf = _cf || []; _cf.push(['_setFsp', true]); _cf.push(['_setBm', true]); _cf.push(['_setAu', '/resources/2109bf5ef81843cd811083f8338393']);" is part of Akami's Bot Detection. They don't want you sraping the site... cURL is probably detected, too. – pcalkins May 30 '19 at 20:08
  • Definitely detects that you are using a bot. Response code gives you 429: Too many request. And won't allow you to use a bot. – Nic Laforge May 30 '19 at 20:22
  • For more info and possible solution you can refer to https://stackoverflow.com/questions/33225947/can-a-website-detect-when-you-are-using-selenium-with-chromedriver – Nic Laforge May 30 '19 at 21:30
  • Thanks for the suggestions I will try the solutions provided on the link – Ch Usman May 31 '19 at 17:33
  • If you can afford one, you might want to use a smart proxy to avoid bot detection. – Gallaecio Jun 03 '19 at 15:09

1 Answers1

0

I took your code added a few tweaks and ran the same test at my end:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

options = webdriver.ChromeOptions() 
options.add_argument("start-maximized")
# options.add_argument('disable-infobars')
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get("https://www.hyatt.com")
WebDriverWait(driver, 20).until(EC.title_contains("Hyatt"))
print(driver.title)
driver.quit()

Eventually I ran into the same issue. Using Selenium I was also unable to even load the webpage. But when I inspected the Console Errors within it clearly showed that:

Failed to load resource: the server responded with a status of 404 () https://www.hyatt.com/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint

Snapshot:

404_fingerprint


404 Not Found

The HTTP 404 Not Found client error response code indicates that the server can't find the requested resource. Links which lead to a 404 page are often called broken or dead links, and can be subject to link rot.

A 404 status code does not indicate whether the resource is temporarily or permanently missing. But if a resource is permanently removed, ideally a 410 (Gone) should be used instead of a 404 status.


Moving ahead, while inspecting the HTML DOM of https://www.hyatt.com/ it was observed that some of the <script> and <noscript> tags refers to akam:

  • <script type="text/javascript" src="https://www.hyatt.com/akam/10/28f56097" defer=""></script>
  • <noscript><img src="https://www.hyatt.com/akam/10/pixel_28f56097?a=dD02NDllZTZmNzg1NmNmYmIyYjVmOGFiOGYwMWI5YWMwZmM4MzcyZGY5JmpzPW9mZg==" style="visibility: hidden; position: absolute; left: -999px; top: -999px;" /></noscript>

Which is a clear indication that the website is protected by Bot Management service provider Akamai Bot Manager and the navigation by WebDriver driven Browser Client gets detected and subsequently gets blocked.


Outro

You can find some more relevant discussions in:

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • 1
    The error 404 is probably a result of the error 429. Console only provide what the developper decide to print. The network tab provides the full trace. See that the request to the site returns 429. As provided in the comment of the question there is already a thread for this error https://stackoverflow.com/questions/33225947/can-a-website-detect-when-you-are-using-selenium-with-chromedriver – Nic Laforge May 30 '19 at 22:35
  • @NicLaforge Suprisingly the _Network Tab_ have no entries registered :( – undetected Selenium May 30 '19 at 22:44
  • 1
    @ DebanjanB it has entries, but you need to open the developer tools prior to call ```driver.get()```. Ensure recording is on (default will be). Here's the network information I am getting: https://ibb.co/MnBQh8M – Nic Laforge May 30 '19 at 23:05
  • 1
    Hey thanks for digging into this, doesn't look like we can bypass the website security. – Ch Usman May 31 '19 at 17:34
  • I will wait to see if someone has a solution to this. – Ch Usman Jun 01 '19 at 12:40
  • Hey, I was able to find a solution to the problem I was facing, it was not impossible to scrape the website we just had to use some good quality paid proxies and boom everything worked as we expected! – Ch Usman Jul 14 '19 at 09:30
  • @ChUsman Sounds great. However **good quality paid proxies** was never a part of your original question. Good luck. – undetected Selenium Jul 15 '19 at 08:56
  • Yeah because at that time I didn't know that it could be the solution – Ch Usman Jul 17 '19 at 09:53
  • @ChUsman which proxies you did used for that site please? – αԋɱҽԃ αмєяιcαη Aug 23 '22 at 03:22