1

I am using selenium webdriver to try scrape information from realestate.com.au, here is my code:

from selenium.webdriver import Chrome from bs4 import BeautifulSoup

path = 'C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe'
url = 'https://www.realestate.com.au/buy'
url2 = 'https://www.realestate.com.au/property-house-nsw-castle+hill-134181706'
webdriver = Chrome(path)
webdriver.get(url)
soup = BeautifulSoup(webdriver.page_source, 'html.parser')
print(soup)

it works fine with URL but when I try to do the same to open url2, it opens up a blank page, and I checked the console get the following: "Failed to load resource: the server responded with a status of 429 () about:blank:1 Failed to load resource: net::ERR_UNKNOWN_URL_SCHEME 149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint:1 Failed to load resource: the server responded with a status of 404 ()"

while opening up URL, I tried to search for anything, which also leads to a blank page like url2.

Swaroop Humane
  • 1,770
  • 1
  • 7
  • 17
MYX1994
  • 11
  • 3

2 Answers2

1

It looks like the www.realestate.com.au website is using an Akamai security tool.

A quick DNS lookup shows that www.realestate.com.au resolves to dualstack.realestate.com.au.edgekey.net.

They are most likely using the Bot Manager product (https://www.akamai.com/us/en/products/security/bot-manager.jsp). I have encountered this on another website recently.

Typically rotating user agents and IP addresses (ideally using residential proxies) should do the trick. You want to load up the site with a "fresh" browser profile each time. You should also check out https://github.com/67-6f-64/akamai-sensor-data-bypass

Jeff Rainer
  • 103
  • 1
  • 5
0

I think you should try adding driver.implicitly_wait(10) before your get line, as this will add an implicit wait, in case the page loads too slowly for the driver to pull the site. Also you should consider trying out the Firefox webdriver, since this bug appears to be only affecting chromium browsers.

NeelD
  • 73
  • 9
  • Hi, I added implicitly_wait(10), and tried with firefox, it still has the same issue, I think it is something to do with the web server that is blocking selenium, is there any way to get pass it? – MYX1994 Aug 13 '20 at 01:02
  • Ah ok, my bad, you should check this SO post out, seems to be a very similar issue to this, the site must be employing some anti-scraping means, here's the link to the solution: [SO](https://stackoverflow.com/questions/33225947/can-a-website-detect-when-you-are-using-selenium-with-chromedriver) – NeelD Aug 13 '20 at 01:17
  • Also are you using an up-to-date version of these webdrivers, since in most of the similar issues I've seen online, they've all been patched out. – NeelD Aug 13 '20 at 01:28
  • 429 error is related to "too Many Requests". I believe the server is identifying that you are using selenium/driver with the help of javascript continuously running on server because of it you are receiving blank_page with 429 status. Following your post for answers. – Swaroop Humane Aug 13 '20 at 02:49