1

I've been using Selenium and Google Colab to download seller data from an auction site. I have been unable to download the content of the site for several weeks. I added fake-user however the result is the same. How otherwise can I look like a real user to download the page?

my code:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from fake_useragent import UserAgent

options = webdriver.ChromeOptions()

ua = UserAgent(use_cache_server=False)
userAgent = ua.random
print(userAgent)

options.add_argument("window-size=1280,800")
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_argument(f'user-agent={userAgent}')

driver = webdriver.Chrome(options=options)
driver.get("https://allegro.pl/oferta/zageszczarka-6-5km-90kg-higher-briggs-gratisy-9003885105#aboutSeller")
print(driver.page_source)

Result:

Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36
<html><head><title>allegro.pl</title><style>#cmsg{animation: A 1.5s;}@keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style><meta name="viewport" content="width=device-width, initial-scale=1.0"></head><body style="margin:0"><script>var dd={'cid':'AHrlqAAAAAMAOIflZgDZm2IAI-ywFA==','hsh':'77DC0FFBAA0B77570F6B414F8E5BDB','t':'fe','s':29560,'host':'geo.captcha-delivery.com'}</script><script src="https://ct.captcha-delivery.com/c.js"></script><script>if("string"==typeof navigator.userAgent&&navigator.userAgent.indexOf("Firefox")>-1){var isIframeLoaded=!1,maxTimeoutMs=5e3;function iframeOnload(e){isIframeLoaded=!0;var a=document.getElementById("noiframe");a&&a.parentNode.removeChild(a)}var initialTime=(new Date).getTime();setTimeout(function(){isIframeLoaded||(new Date).getTime()-initialTime>maxTimeoutMs&&(document.body.innerHTML='<div id="noiframe">Please enable JS and disable any ad blocker</div>'+document.body.innerHTML)},maxTimeoutMs)}else function iframeOnload(){}</script><iframe src="https://geo.captcha-delivery.com/captcha/?initialCid=AHrlqAAAAAMAOIflZgDZm2IAI-ywFA%3D%3D&amp;hash=77DC0FFBAA0B77570F6B414F8E5BDB&amp;cid=ak0Wk_5LBEPLw9rTmErZ~211JLk9IruT-DV3pn2r.NzAZ_JOOcDsOjFjoiO8O88Uty8imz7f4IXqYdOqun_vy9SJOl7y7x-cu4m.D1jxOt&amp;t=fe&amp;referer=https%3A%2F%2Fallegro.pl%2Foferta%2Fzageszczarka-6-5km-90kg-higher-briggs-gratisy-9003885105%23aboutSeller&amp;s=29560" width="100%" height="100%" style="height:100vh;" frameborder="0" border="0" scrolling="yes" onload="iframeOnload()"></iframe>
</body></html>
dominik
  • 61
  • 1
  • 2
  • 8
  • `fake_useragent` and other user agent generating modules give very old user agents. Can you try hardcoding your own user browser's user agent and see if you are able to scrape? If yes, then just collect more just new user agents, and use them randomly rather than using a module for it. – Shreyesh Desai Apr 11 '21 at 11:19
  • There were some headers that Selenium/webdriver inserts that makes it appear as a robot. You can search for How to avoid reCaptcha in Selenium Python, and you'll find the headers you need to add. – thethiny Apr 11 '21 at 11:25
  • @ShreyeshDesai I used agent from my browser: `Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36` with the same result – dominik Apr 11 '21 at 11:39

1 Answers1

1

I checked the site, seems like it can blacklist an IP if you use Selenium Chrome browser.

This should work (with HEAD mode, HEADLESS mode is not guaranteed) https://github.com/ultrafunkamsterdam/undetected-chromedriver

Also, the server running Google Colab should not have the blacklisted IP. If it is, too bad you cannot really do anything about it.


Edit: you can know more about how sites detect Selenium driver here: https://stackoverflow.com/a/56529616/8068153

jackblk
  • 1,076
  • 8
  • 19