2

Code Proposal:

Collecting the links to all the games of the day present on the page (https://int.soccerway.com/matches/2021/07/28/), giving me the freedom to change the date to whatever I want, such as 2021/08/01 and so on. So that in the future I can loop and collect the list from several different days at the same time, in one code call.

Even though it's a very slow model, without using Headless, this model clicks all the buttons, expands the data and imports all 465 listed match links:

for btn in driver.find_elements_by_xpath("//tr[contains(@class,'group-head  clickable')]"):
    btn.click()

Full Code:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time

options = Options()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-logging"])
driver = webdriver.Chrome(r"C:\Users\Computador\Desktop\Python\chromedriver.exe", options=options)

url = "https://int.soccerway.com/matches/2021/07/28/"

driver.get(url)
driver.find_element_by_xpath("//div[@class='language-picker-trigger']").click()
driver.find_element_by_xpath("//a[@href='https://int.soccerway.com']").click()
time.sleep(10)
for btn in driver.find_elements_by_xpath("//tr[contains(@class,'group-head  clickable')]"):
    btn.click()
time.sleep(10)
jogos = driver.find_elements_by_xpath("//td[contains(@class,'score-time')]//a")
for jogo in jogos:
    resultado = jogo.get_attribute("href")
    print(resultado)
driver.quit()

But when I add options.add_argument("headless") so that the browser is not opened on my screen, the model returns the following error:

Message: element click intercepted

To get around this problem, I analyzed options and found this one on WebDriverWait (https://stackoverflow.com/a/62904494/11462274) and tried to use it like this:

for btn in WebDriverWait(driver, 1).until(EC.element_to_be_clickable((By.XPATH, "//tr[contains(@class,'group-head  clickable')]"))):
    btn.click()

Full Code:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time

from selenium.webdriver.support.ui import WebDriverWait       
from selenium.webdriver.common.by import By       
from selenium.webdriver.support import expected_conditions as EC

options = Options()
options.add_argument("start-maximized")
options.add_argument("headless")
options.add_experimental_option("excludeSwitches", ["enable-logging"])
driver = webdriver.Chrome(r"C:\Users\Computador\Desktop\Python\chromedriver.exe", options=options)

url = "https://int.soccerway.com/matches/2021/07/28/"

driver.get(url)
driver.find_element_by_xpath("//div[@class='language-picker-trigger']").click()
driver.find_element_by_xpath("//a[@href='https://int.soccerway.com']").click()
time.sleep(10)
for btn in WebDriverWait(driver, 1).until(EC.element_to_be_clickable((By.XPATH, "//tr[contains(@class,'group-head  clickable')]"))):
    btn.click()
time.sleep(10)
jogos = driver.find_elements_by_xpath("//td[contains(@class,'score-time')]//a")
for jogo in jogos:
    resultado = jogo.get_attribute("href")
    print(resultado)
driver.quit()

But because it's not iterable, it returns in error:

'NoneType' object is not iterable

Why do I need this option?

1 - I'm going to automate it in an online terminal, so there won't be any browser to open on the screen and I need to make it fast so I don't spend too much of my time limits on the terminal.

2 - I need to find an option that I can use any date instead of 2021/07/28 in:

url = "https://int.soccerway.com/matches/2021/07/28/"

Where in the future I'll add the parameter:

today = date.today().strftime("%Y/%m/%d")

In this answer (https://stackoverflow.com/a/68535595/11462274), a guy indicated a very fast and interesting option (He named the option at the end of the answer as: Quicker Version) without the need for a WebDriver, but I was only able to make it work on the first page of the site, when I try to use other dates of the year, he keeps returning only the links to the games of the current day.

Expected Result (there are 465 links but I didn't put the entire result because there is a character limit):

https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/fc-sheriff-tiraspol/alashkert-fc/3517568/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/fk-neftchi/olympiakos-cfp/3517569/        
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/scs-cfr-1907-cluj-sa/newcastle-fc/3517571/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/fc-midtjylland/celtic-fc/3517576/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/fk-razgrad-2000/mura/3517574/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/galatasaray-sk/psv-nv/3517577/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/bsc-young-boys-bern/k-slovan-bratislava/3517566/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/fk-crvena-zvezda-beograd/fc-kairat-almaty/3517570/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/ac-sparta-praha/sk-rapid-wien/3517575/
https://int.soccerway.com/matches/2021/07/28/world/olympics/saudi-arabia-u23/brazil--under-23/3497390/
https://int.soccerway.com/matches/2021/07/28/world/olympics/germany-u23/cote-divoire-u23/3497391/
https://int.soccerway.com/matches/2021/07/28/world/olympics/romania-u23/new-zealand-under-23/3497361/
https://int.soccerway.com/matches/2021/07/28/world/olympics/korea-republic-u23/honduras-u23/3497362/
https://int.soccerway.com/matches/2021/07/28/world/olympics/australia-under-23/egypt-under-23/3497383/
https://int.soccerway.com/matches/2021/07/28/world/olympics/spain-under-23/argentina-under-23/3497384/
https://int.soccerway.com/matches/2021/07/28/world/olympics/france-u23/japan-u23/3497331/
https://int.soccerway.com/matches/2021/07/28/world/olympics/south-africa-u23/mexico-u23/3497332/
https://int.soccerway.com/matches/2021/07/28/africa/cecafa-senior-challenge-cup/uganda-under-23/eritrea-under-23/3567664/

Note 1: There are multiple types of score-time, such as score-time status and score-time score, that's why I used contains in "//td[contains(@class,'score-time')]//a"

Update

If possible, in addition to helping me solve the current problem, I am interested in an improved and faster option for the method I currently use. (I'm still learning, so my methods are pretty archaic).

halfer
  • 19,824
  • 17
  • 99
  • 186
Digital Farmer
  • 1,705
  • 5
  • 17
  • 67

3 Answers3

1

Brondby IF,

I see, two issues with your script

First is

for btn in WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//tr[contains(@class,'group-head  clickable')]"))):
    btn.click()

basically, this is wrong cause element_to_be_clickable will again return a single webelement, so you will get non-inerrable error instead we can use visibility_of_all_elements_located that will return a list.

Second you can not click directly, cause few elements are not in Selenium view port, so we will have to use ActionsChain

See below :

options = webdriver.ChromeOptions()
options.add_argument("--disable-infobars")
options.add_argument("start-maximized")
options.add_argument("--disable-extensions")
options.add_experimental_option("prefs", {"profile.default_content_setting_values.notifications": 2})
options.add_argument("--headless")
options.add_experimental_option("prefs", {"profile.default_content_settings.cookies": 2})

driver = webdriver.Chrome(options = options)
driver.implicitly_wait(30)
driver.get("https://int.soccerway.com/")
driver.find_element_by_xpath("//div[@class='language-picker-trigger']").click()
driver.find_element_by_xpath("//a[@href='https://int.soccerway.com']").click()
sleep(10)
for btn in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//tr[contains(@class,'group-head  clickable')]"))):
    ActionChains(driver).move_to_element(btn).click().perform()
sleep(10)
jogos = driver.find_elements_by_xpath("//td[contains(@class,'score-time')]//a")
for jogo in jogos:
    resultado = jogo.get_attribute("href")
    print(resultado)

Output :

https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/fc-sheriff-tiraspol/alashkert-fc/3517568/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/fk-neftchi/olympiakos-cfp/3517569/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/scs-cfr-1907-cluj-sa/newcastle-fc/3517571/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/fc-midtjylland/celtic-fc/3517576/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/fk-razgrad-2000/mura/3517574/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/galatasaray-sk/psv-nv/3517577/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/bsc-young-boys-bern/k-slovan-bratislava/3517566/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/fk-crvena-zvezda-beograd/fc-kairat-almaty/3517570/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/ac-sparta-praha/sk-rapid-wien/3517575/
https://int.soccerway.com/matches/2021/07/28/world/olympics/saudi-arabia-u23/brazil--under-23/3497390/
https://int.soccerway.com/matches/2021/07/28/world/olympics/germany-u23/cote-divoire-u23/3497391/
https://int.soccerway.com/matches/2021/07/28/world/olympics/romania-u23/new-zealand-under-23/3497361/
https://int.soccerway.com/matches/2021/07/28/world/olympics/korea-republic-u23/honduras-u23/3497362/
https://int.soccerway.com/matches/2021/07/28/world/olympics/australia-under-23/egypt-under-23/3497383/
https://int.soccerway.com/matches/2021/07/28/world/olympics/spain-under-23/argentina-under-23/3497384/
https://int.soccerway.com/matches/2021/07/28/world/olympics/france-u23/japan-u23/3497331/
https://int.soccerway.com/matches/2021/07/28/world/olympics/south-africa-u23/mexico-u23/3497332/
cruisepandey
  • 28,520
  • 6
  • 20
  • 38
  • Hi @cruisepandey , thanks for help, but in this case the buttons are not being click, correct? Because when the buttons are clicked, leagues that have hidden games open, creating a much larger list of collected links. Is the only way to open these buttons using the active screen to browser? – Digital Farmer Jul 28 '21 at 05:53
  • @BrondbyIF : as you have mentioned the same code works with browser mode, I believe the output should be same with headless mode as well. – cruisepandey Jul 28 '21 at 06:21
  • Hi @cruisepandey This is the problem, when I use ```for btn in driver.find_elements_by_xpath("//tr[contains(@class,'group-head clickable')]"):```, all buttons are clicked, but only works when ```Headless``` is disabled, with it enabled the buttons are not clicked. – Digital Farmer Jul 28 '21 at 12:49
  • @BrondbyIF : Since now we have bounty in place, could you tell me the steps that you want to automate ? each and every steps I need to figure this out – cruisepandey Jul 30 '21 at 11:25
  • Hello my friend @cruisepandey good afternoon. I added the proposed use of the code at the beginning of the question, but it's very simple, collect all the game links present on that site's page that has the date at the end of the url like ```https://int.soccerway.com/matches/2021/07/31/``` so I can change the date whenever I need it. – Digital Farmer Jul 30 '21 at 13:34
1

You don't need Selenium

Selenium should never be the primary way of scraping data from the web. It's slow and generally requires more lines of code than its alternatives. Whenever possible, use requests coupled with the lxml parser. In this particular use case, you're using selenium only to switch between different URLs, which is something that can be easily hardcoded, thereby avoiding the need to use it in the first place.

import requests
from lxml import html
import csv
import re
from datetime import datetime
import json

class GameCrawler(object):
    def __init__(self):
        self.input_date = input('Specify a date e.g. 2021/07/28: ')
        self.date_object = datetime.strptime(self.input_date, "%Y/%m/%d")
        self.output_file = '{}.csv'.format(re.sub('/', '-', self.input_date))
        self.ROOT_URL = 'https://int.soccerway.com'
        self.json_request_url = '{}/a/block_competition_matches_summary'.format(self.ROOT_URL)
        self.entry_point = '{}/matches/{}'.format(self.ROOT_URL, self.input_date)
        self.session = requests.Session()
        self.HEADERS = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
        self.all_game_urls = []
        self.league_urls = self.get_league_urls()

    def save_to_csv(self):
        with open(self.output_file, 'a+') as f:
            writer = csv.writer(f)
            for row in self.all_game_urls:
                writer.writerow([row]) 
        return

    def request_other_pages(self, page_params):
        params = {
            'block_id': 'page_competition_1_block_competition_matches_summary_11',
            'callback_params': json.dumps({
                "page": page_params['page_count'] + 2, 
                "block_service_id": "competition_summary_block_competitionmatchessummary",
                "round_id": int(page_params['round_id']),
                "outgroup":"",
                "view":1,
                "competition_id": int(page_params['competition_id'])
            }),
            'action': 'changePage',
            'params': json.dumps({"page": page_params['page_count']}),
        }
        response = self.session.get(self.json_request_url, headers=self.HEADERS, params=params)
        if response.status_code != 200:
            return
        else:
            json_data = json.loads(response.text)["commands"][0]["parameters"]["content"]
            return html.fromstring(json_data)

    def get_page_params(self, tree, response):
        res = re.search('r(\d+)?/$', response.url)
        if res:
            page_params = {
                'round_id': res.group(1),
                'competition_id': tree.xpath('//*[@data-competition]/@data-competition')[0],
                'page_count': len(tree.xpath('//*[@class="page-dropdown"]/option'))
            }
            return page_params if page_params['page_count'] != 0 else {}
        return {}

    def match_day_check(self, game):
        timestamp = game.xpath('./@data-timestamp')[0]
        match_date = datetime.fromtimestamp(int(timestamp))
        return True if self.date_object.day == match_date.day else False

    def scrape_page(self, tree):
        for game in tree.xpath('//*[@data-timestamp]'):
            game_url = game.xpath('./td[@class="score-time "]/a/@href')
            if game_url and self.match_day_check(game):
                self.all_game_urls.append('{}{}'.format(self.ROOT_URL, game_url[0]))
        return

    def get_league_urls(self):
        page = self.session.get(self.entry_point, headers=self.HEADERS)
        tree = html.fromstring(page.content)
        league_urls = ['{}{}'.format(self.ROOT_URL, league_url) for league_url in tree.xpath('//th[@class="competition-link"]/a/@href')]
        return league_urls

    def main(self):
        for index, league_url in enumerate(self.league_urls):
            response = self.session.get(league_url, headers=self.HEADERS)
            tree = html.fromstring(response.content)
            self.scrape_page(tree)
            page_params = self.get_page_params(tree, response)
            if page_params.get('page_count', 0) != 0:
                while True:
                    page_params['page_count'] = page_params['page_count'] - 1
                    if page_params['page_count'] == 0:
                        break
                    tree = self.request_other_pages(page_params)
                    if tree is None:
                        continue
                    self.scrape_page(tree)
            print('Retrieved links for {} out of {} competitions'.format(index+1, len(self.league_urls)))
        self.save_to_csv()
        return

if __name__ == '__main__':
    GameCrawler().main()

So when is Selenium worth using?

Nowadays, it's common for websites to serve dynamic content, so if the data that you'd want to retrieve isn't statically loaded:

  1. check the browser's network tab to see whether there's a request specific to the data of interest to you and,
  2. try to emulate it with requests.

If points #1 and #2 aren't possible due to the way in which the webpage is designed, your best option would then be to use selenium which will fetch the required content through simulated user interactions. For the HTML parsing, you may still choose to use lxml, or you can stick to selenium which also provided that functionality.

First edit:

  • Fixed issues raised by OP
  • Included a limitation of the presented code
  • Code refactoring
  • Added a date check to make sure that only those matches which were played on the specified date are saved
  • Added functionality for allowing search results to be saved

Second edit:

  • Added functionality for navigating through all pages of each listed competition with get_page_params() and request_other_pages()
  • More code refactoring
micmalti
  • 561
  • 3
  • 16
  • Hello buddy, how are you? I am doing tests and I believe that the model shared by you does not meet my need, as it is accessing the same site from several different countries, but only collecting the exposed links, it is not collecting the hidden links of the games inside on leagues that we need to click on the buttons to appear. My need is to collect all the links on the ```https://int.``` version page, which are over 400 links to collect. – Digital Farmer Aug 02 '21 at 00:41
  • And @micmalti I had to click buttons at first, because when the webdriver opens the browser, it comes clean, so the first time it accesses the site, you are directed to your country's version, but I need the international version, so I need to click on the menu and click on the ```https://int.```, the other part that also clicks on the buttons, are to expand the leagues that hide the games and to see them they need to be clicked. – Digital Farmer Aug 02 '21 at 00:49
  • 1
    @BrondbyIF thanks for pointing out the errors in my code. I've now updated my answer. – micmalti Aug 02 '21 at 17:54
  • Hi @micmalti, But from what I'm seeing, unfortunately this is not an option that would solve my problem, because in the case from what I understand you are getting the competitions link, accessing their page and collecting the games, right? Unfortunately the site has some flaws and within these competition pages, not always appear every game of the day, if it is a late round match for example, it does not appear. The option would actually be to collect on the main page. – Digital Farmer Aug 02 '21 at 18:58
  • For example, your project collected a considerably smaller number of links than the total. In my model, 448 links on day ```2021/07/28``` are currently collected, in your 174 links – Digital Farmer Aug 02 '21 at 18:59
  • The data existing within competitions pages is different from the daily listing on the games page. – Digital Farmer Aug 02 '21 at 19:00
  • 1
    @BrondbyIF, I'm aware of this, which is why I included it as a limitation in my edit. This can be resolved by doing what I've suggested in points 1 and 2. I'll try and do it later. – micmalti Aug 02 '21 at 19:22
  • Sorry I didn't understand the second point, my English is poor. But now I get it! Thanks again for taking the time to try and help me! – Digital Farmer Aug 02 '21 at 19:28
  • 1
    @BrondbyIF I've updated my answer once again to overcome the limitation that was present in the previous code iteration. My script managed to scrape 547 links for `2021/07/28`. – micmalti Aug 04 '21 at 02:03
  • Hi mate, I put date like: ```self.input_date = "2021/08/03"``` and I put the code to run and nothing happened, I have no knowledge about the ```input()``` that was placed, I tried to write the date in the terminal and nothing happens either, what is the correct way to proceed? Thanks in advance. – Digital Farmer Aug 04 '21 at 03:42
  • 1
    @BrondbyIF, the script works fine. You just needed to leave it for a few minutes until it was ready. I've now added a `print` statement to keep track of its progress. The data is automatically saved into a CSV file within the same directory as the script. – micmalti Aug 04 '21 at 11:36
0

Try to add the Options window-size Without using WebDriverWait

options.add_argument("window-size=1440,900")

O/P

enter image description here

YaDav MaNish
  • 1,260
  • 2
  • 12
  • 20
  • Hi @YaDav Manish , thanks for help, but in this case the buttons are not being click, correct? Because when the buttons are clicked, leagues that have hidden games open, creating a much larger list of collected links. Is the only way to open these buttons using the active screen to browser? – Digital Farmer Jul 28 '21 at 05:54
  • 1
    @BrondbyIF I need to look into it for this, but my main focus in this answer was the ElementClickInterceptedException – YaDav MaNish Jul 28 '21 at 05:56