1

Hey guys I have been trying to scrape some data from the cricinfo website for commentary of every match. I am able to get the full data for the second innings.. but unable to do so for the first innings as the drop-down present does not seem to have options or anything such as select class when I inspect source code.. it would be great if someone could suggest some options to do this. This is the URL of the page https://www.espncricinfo.com/series/8048/commentary/1181768/mumbai-indians-vs-chennai-super-kings-final-indian-premier-league-2019[enter image description here]1

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
VYAS
  • 13
  • 4
  • What information do you need to get from the page? – Andrej Kesely Sep 20 '20 at 10:15
  • So basically if you see there is a filter for innings.. like MI innings or CSK innings.. I need commentary data of both innings.. I am able to get it for csk innings but not for MI innings as I am unable to change the filter – VYAS Sep 20 '20 at 11:00

2 Answers2

0

The data is loaded dynamically via JavaScript. You can use requests/json module to load the data into Python:

import re
import json
import requests
from bs4 import BeautifulSoup

url = 'https://www.espncricinfo.com/series/8048/commentary/1181768/mumbai-indians-vs-chennai-super-kings-final-indian-premier-league-2019'
api_url = 'https://hsapi.espncricinfo.com/v1/pages/match/comments?lang=en&leagueId={leagueId}&eventId={eventId}&liveTest=false&filter=full&page={page}'

leagueId, eventId = re.findall(r'(\d+)/commentary/(\d+)', url)[0]

page = 1
while True:
    data = requests.get(api_url.format(page=page, leagueId=leagueId, eventId=eventId)).json()

    # uncomment next line to see all data:
    # print(json.dumps(data, indent=4))

    # print some data to screen:
    for comment in data['comments']:
        soup1 = BeautifulSoup(comment['preText'], 'html.parser')
        soup2 = BeautifulSoup(comment['text'], 'html.parser')
        soup3 = BeautifulSoup(comment['postText'], 'html.parser')

        print(soup1.get_text(strip=True, separator='\n'))
        print(soup2.get_text(strip=True, separator='\n'))
        print(soup3.get_text(strip=True, separator='\n'))

        print('-' * 80)

    page += 1

    if page > data['pagination']['pageCount']:
        break

Prints:

...

final ball. Can Mumbai cross 150? Pollard needs a six.
slower ball, full outside off, and
that's been smoked!
Drilled through the covers and
Chennai Super Kings 150 to win IPL 2019!
9.16pm
Another ravishing innings from Pollard against CSK in an IPL final. But will 150 be enough on this ground? Mumbai's innings was a stop-start one, with regular wickets ensuring they could never really accelerate. Deepak Chahar was excellent in his final three overs too, but Mumbai have two epic fast bowlers as well. Which team will win their fourth IPL title? We'll find out with Shashank Kishore when the second innings gets underway in a few minutes.
Shardul Thakur:
"Final game, best two teams in the IPL. We knew some hard cricket was going to happen. I feel Powerplay is where you can attack and take wicket. If you bowl defensively in the Powerplay, you will still get hit for fours and sixes. In the last game, I wanted to get early wickets but there was some good cricket played by Dhawan. But tonight, ball was swinging a bit. Rohit did hit me for a six, but idea wasn't to go away from my plan."
Raja: "@Vignesh That team did not have Dhoni as CAPTAIN"
Vignesh: "@Husen well , MI defended an even more low total in the same ground in 2017 finals against a team that had Dhoni ;)"
Satyam: "Think MI are 20-25 runs short here. At least 15 more would have been more defendable."
Divya: "Last 12 balls : 3 fours 3 wickets 6 dots 1 singles"
Mustafa Moudi: "If anyone feels this is a below-par score then let me remind everyone that MI defended 137 on this same ground and that too by a massive 40 runs and defeated the Home Team in this season !!"
Husen: "@Moustafa - That team did not have a Dhoni "


...
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Thanks a lot.. this works.. I have put some selenium code as well.. do you know how to do it there for both innings.. – VYAS Sep 20 '20 at 14:18
0

To scrape the data from the cricinfo website for commentary of the first innings of every match match modifying the filter using Selenium you need to induce WebDriverWait for the visibility_of_element_located() and you can use the following Locator Strategies:

  • Using XPATH:

    # -*- coding: utf­-8 ­-*-
    from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
    options = webdriver.ChromeOptions()
    options.add_argument("start-maximized")
    options.add_experimental_option("excludeSwitches", ["enable-logging"])
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('useAutomationExtension', False)
    driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')   driver.get('https://www.espncricinfo.com/series/8048/commentary/1181768/mumbai-indians-vs-chennai-super-kings-final-indian-premier-league-2019')
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[@class='comment-container-head']/div/div/div/div"))).click()
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[contains(@class, 'ci-dd__menu')]/div[contains(@class, 'ci-dd__menu-list')]/div[contains(@class, 'ci-dd__option') and text()='MI Innings']"))).click()
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@class='match-comment-long-text match-comment-padder']/span"))).text)
    
  • Console Output:

    9.16pm Another ravishing innings from Pollard against CSK in an IPL final. But will 150 be enough on this ground? Mumbai's innings was a stop-start one, with regular wickets ensuring they could never really accelerate. Deepak Chahar was excellent in his final three overs too, but Mumbai have two epic fast bowlers as well. Which team will win their fourth IPL title? We'll find out with Shashank Kishore when the second innings gets underway in a few minutes.
    Shardul Thakur: "Final game, best two teams in the IPL. We knew some hard cricket was going to happen. I feel Powerplay is where you can attack and take wicket. If you bowl defensively in the Powerplay, you will still get hit for fours and sixes. In the last game, I wanted to get early wickets but there was some good cricket played by Dhawan. But tonight, ball was swinging a bit. Rohit did hit me for a six, but idea wasn't to go away from my plan."
    Raja: "@Vignesh That team did not have Dhoni as CAPTAIN"
    Vignesh: "@Husen well , MI defended an even more low total in the same ground in 2017 finals against a team that had Dhoni ;)"
    Satyam: "Think MI are 20-25 runs short here. At least 15 more would have been more defendable."
    Divya: "Last 12 balls : 3 fours 3 wickets 6 dots 1 singles"
    Mustafa Moudi: "If anyone feels this is a below-par score then let me remind everyone that MI defended 137 on this same ground and that too by a massive 40 runs and defeated the Home Team in this season !!"
    Husen: "@Moustafa - That team did not have a Dhoni "
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • Thanks for the info.. I get a timeout exception error.. should I do something else.. – VYAS Sep 20 '20 at 18:33
  • @VYAS At which line. Can you update the question with the entire error stack trace? – undetected Selenium Sep 20 '20 at 18:35
  • Sometimes I also an element click interception error for some reason..any idea what that is.. – VYAS Sep 20 '20 at 18:58
  • Remove the `implicitly_wait(15)`, you can't use it when using **WebDriverWait**. Just copy and paste my code. – undetected Selenium Sep 20 '20 at 19:06
  • 1
    Thanks a lot for the effort.. I still some element intercepted error.. I see the privacy policy pop up opens in new window could that be the reason? – VYAS Sep 20 '20 at 19:18
  • @VYAS I just concentrated on your usecase of scraping text but didn't consider handling of the _privacy policy pop up_ assuming you were through it. – undetected Selenium Sep 20 '20 at 19:21
  • 1
    Appreciate the efforts.. i was also confused if that needs a bypass as well.. for the normal 2nd innings code also it loads bit did not create any issues.. any specific line of code that needs to go in to bypass privacy policy.. – VYAS Sep 20 '20 at 19:26
  • I'm still not sure about the _privacy policy_ you are speaking about as I didn't face it. – undetected Selenium Sep 20 '20 at 19:28
  • Brilliant one This works.. WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//button[text()='Continue']"))).click() – VYAS Sep 20 '20 at 19:31
  • 1
    Thanks so much for all your efforts.. really wonderful have a nice week ahead.. – VYAS Sep 20 '20 at 19:31
  • 1
    Actually I live in Europe we have strict gdpr rules here.. so when cricinfo website loads the I accept privacy policy pop up comes..so I had to click on that finding the element and then it works perfectly.. just an extra line in code to click that – VYAS Sep 20 '20 at 20:01
  • hi I tried the code and seems like something has changed about it.. atleast the last line to select the team innings... any idea – VYAS Feb 26 '21 at 11:23