I read a lot of posts on the topic, and also tried some of this article's advice, but I am still blocked.
https://www.scraperapi.com/blog/5-tips-for-web-scraping
IP Rotation: done I'm using a VPN and often changing IP (but not DURING the script, obviously)
Set a Real User-Agent: implemented fake-useragent with no luck
Set other request headers: tried with SeleniumWire but how to use it at the same time than 2.?
Set random intervals in between your requests: done but anyway at the present time I even cannot access the starting home page !!!
Set a referer: same as 3.
Use a headless browser: no clue
Avoid honeypot traps: same as 4.
-
- 10: irrelevant
The website I want to scrape: https://www.winamax.fr/paris-sportifs/
Without Selenium: it goes smoothly to a page with some games and their odds, and I can navigate from here
With Selenium: the page shows a "Winamax est actuellement en maintenance" message and no games and no odds
Try to execute this piece of code and you might get blocked quite quickly :
from selenium import webdriver
import time
from time import sleep
import json
driver = webdriver.Chrome(executable_path="chromedriver")
driver.get("https://www.winamax.fr/paris-sportifs/") #I'm even blocked here now !!!
toto = driver.page_source.splitlines()
titi = {}
matchez = []
matchez_detail = []
resultat_1 = {}
resultat_2 = {}
taratata = 1
comptine = 1
for tut in toto:
if tut[0:53] == "<script type=\"text/javascript\">var PRELOADED_STATE = ": titi = json.loads(tut[53:tut.find(";var BETTING_CONFIGURATION = ")])
for p_id in titi.items():
if p_id[0] == "sports":
for fufu in p_id:
if isinstance(fufu, dict):
for tyty in fufu.items():
resultat_1[tyty[0]] = tyty[1]["categories"]
for p_id in titi.items():
if p_id[0] == "categories":
for fufu in p_id:
if isinstance(fufu, dict):
for tyty in fufu.items():
resultat_2[tyty[0]] = tyty[1]["tournaments"]
for p_id in resultat_1.items():
for tgtg in p_id[1]:
for p_id2 in resultat_2.items():
if str(tgtg) == p_id2[0]:
for p_id3 in p_id2[1]:
matchez.append("https://www.winamax.fr/paris-sportifs/sports/"+str(p_id[0])+"/"+str(tgtg)+"/"+str(p_id3))
for alisson in matchez:
print("compet " + str(taratata) + "/" + str(len(matchez)) + " : " + alisson)
taratata = taratata + 1
driver.get(alisson)
sleep(1)
elements = driver.find_elements_by_xpath("//*[@id='app-inner']/div/div[1]/span/div/div[2]/div/section/div/div/div[1]/div/div/div/div/a")
for elm in elements:
matchez_detail.append(elm.get_attribute("href"))
for mat in matchez_detail:
print("match " + str(comptine) + "/" + str(len(matchez_detail)) + " : " + mat)
comptine = comptine + 1
driver.get(mat)
sleep(1)
elements = driver.find_elements_by_xpath("//*[@id='app-inner']//button/div/span")
for elm in elements:
elm.click()
sleep(1) # and after my specific code to scrape what I want