0

I read a lot of posts on the topic, and also tried some of this article's advice, but I am still blocked.

https://www.scraperapi.com/blog/5-tips-for-web-scraping

  1. IP Rotation: done I'm using a VPN and often changing IP (but not DURING the script, obviously)

  2. Set a Real User-Agent: implemented fake-useragent with no luck

  3. Set other request headers: tried with SeleniumWire but how to use it at the same time than 2.?

  4. Set random intervals in between your requests: done but anyway at the present time I even cannot access the starting home page !!!

  5. Set a referer: same as 3.

  6. Use a headless browser: no clue

  7. Avoid honeypot traps: same as 4.

    1. 10: irrelevant

The website I want to scrape: https://www.winamax.fr/paris-sportifs/

Without Selenium: it goes smoothly to a page with some games and their odds, and I can navigate from here

With Selenium: the page shows a "Winamax est actuellement en maintenance" message and no games and no odds

Try to execute this piece of code and you might get blocked quite quickly :

from selenium import webdriver
import time
from time import sleep
import json

driver = webdriver.Chrome(executable_path="chromedriver")
driver.get("https://www.winamax.fr/paris-sportifs/")   #I'm even blocked here now !!!

toto = driver.page_source.splitlines()
titi = {}
matchez = []
matchez_detail = []
resultat_1 = {}
resultat_2 = {}
taratata = 1
comptine = 1

for tut in toto:
    if tut[0:53] == "<script type=\"text/javascript\">var PRELOADED_STATE = ": titi = json.loads(tut[53:tut.find(";var BETTING_CONFIGURATION = ")])

for p_id in titi.items():
    if p_id[0] == "sports": 
        for fufu in p_id:
            if isinstance(fufu, dict):
                for tyty in fufu.items():
                    resultat_1[tyty[0]] = tyty[1]["categories"]

for p_id in titi.items():
    if p_id[0] == "categories": 
        for fufu in p_id:
            if isinstance(fufu, dict):
                for tyty in fufu.items():
                    resultat_2[tyty[0]] = tyty[1]["tournaments"]

for p_id in resultat_1.items():
    for tgtg in p_id[1]:
        for p_id2 in resultat_2.items():
            if str(tgtg) == p_id2[0]: 
                for p_id3 in p_id2[1]:
                    matchez.append("https://www.winamax.fr/paris-sportifs/sports/"+str(p_id[0])+"/"+str(tgtg)+"/"+str(p_id3))

for alisson in matchez:
    print("compet " + str(taratata) + "/" + str(len(matchez)) + " : " + alisson)
    taratata = taratata + 1
    driver.get(alisson)
    sleep(1)
    elements = driver.find_elements_by_xpath("//*[@id='app-inner']/div/div[1]/span/div/div[2]/div/section/div/div/div[1]/div/div/div/div/a")
    for elm in elements:
        matchez_detail.append(elm.get_attribute("href"))

for mat in matchez_detail:
    print("match " + str(comptine) + "/" + str(len(matchez_detail)) + " : " + mat)
    comptine = comptine + 1
    driver.get(mat)
    sleep(1)
    elements = driver.find_elements_by_xpath("//*[@id='app-inner']//button/div/span")
    for elm in elements:
        elm.click()
        sleep(1) # and after my specific code to scrape what I want
jeremoquai
  • 101
  • 2
  • 10

1 Answers1

1

I recommend using requests , I don’t see a reason to use selenium since you said requests works, and requests can work with pretty much any site as long as you are using appropriate headers, you can see the headers needed by looking at the developer console in chrome or Firefox and looking at the request headers.

Noah
  • 154
  • 11