2

The website is giving me same results for different urls scraped. I guess the reason for this is that selenium is not letting the website load completely before producing the result. I wrote my code using beautiful soup first but according to SO community, selenium had to be used to get the final webpage to scrape. I implemented selenium to scrape the data and beautiful soup for parsing the data but still the same problem persists. The code is given below:

from bs4 import BeautifulSoup
import pandas as pd
import requests
import re
import datetime
import os
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
date_list = pd.date_range(start = "1971-02-01", end=datetime.date.today(), freq='1d')


chrome_options = Options()  
chrome_options.add_argument("--headless") # Opens the browser up in background
driver = webdriver.Chrome()

def get_batsmen(date):
    url = f'https://www.icc-cricket.com/rankings/mens/player-rankings/odi/batting?at={date}'
    with Chrome(options=chrome_options) as browser:
        browser.get(url)
        html = browser.page_source
        browser.implicitly_wait(10)
        
    
    doc = BeautifulSoup(html, "html.parser")
    find_class = doc.find_all("td", class_ = 'table-body__cell rankings-table__name name')
    player_list = []
    find_top = doc.find('div', class_='rankings-block__banner--name-large')
    player_list.append(find_top.text)
    for item in find_class:
        player_name = item.find("a")
        # print(player_name.text)
        player_list.append(player_name.text)
    df = pd.DataFrame(player_list, columns = ['Player Name'])
    return df

def get_bowler(date):
    url = f'https://www.icc-cricket.com/rankings/mens/player-rankings/odi/bowling?at={date}'
    # page = requests.get(url).text
    with Chrome(options=chrome_options) as browser:
        browser.get(url)
        html = browser.page_source
    doc = BeautifulSoup(html, "html.parser")
    find_class = doc.find_all("td", class_ = 'table-body__cell rankings-table__name name')
    player_list = []
    find_top = doc.find('div', class_='rankings-block__banner--name-large')
    player_list.append(find_top.text)
    for item in find_class:
        player_name = item.find("a")
        # print(player_name.text)
        player_list.append(player_name.text)
    df = pd.DataFrame(player_list, columns = ['Player Name'])
    return df

def get_allrounder(date):
    url = f'https://www.icc-cricket.com/rankings/mens/player-rankings/odi/all-rounder?at={date}'
    # page = requests.get(url).text
    with Chrome(options=chrome_options) as browser:
        browser.get(url)
        html = browser.page_source
    doc = BeautifulSoup(html, "html.parser")
    find_class = doc.find_all("td", class_ = 'table-body__cell rankings-table__name name')
    player_list = []
    find_top = doc.find('div', class_='rankings-block__banner--name-large')
    player_list.append(find_top.text)
    for item in find_class:
        player_name = item.find("a")
        # print(player_name.text)
        player_list.append(player_name.text)
    df = pd.DataFrame(player_list, columns = ['Player Name'])
    return df

#Storing the data into multiple csvs

for date in date_list:
    year = date.year
    month = date.month
    day = date.day
    newpath = rf'C:\Users\divya\OneDrive\Desktop\8th Sem\ISB assignment\{year}'
    if not os.path.exists(newpath):
        os.makedirs(newpath)
    newpath1 = rf'C:\Users\divya\OneDrive\Desktop\8th Sem\ISB assignment\{year}\{month}'
    if not os.path.exists(newpath1):
        os.makedirs(newpath1)
    newpath2 = rf'C:\Users\divya\OneDrive\Desktop\8th Sem\ISB assignment\{year}\{month}\{day}'
    if not os.path.exists(newpath2):
        os.makedirs(newpath2)
    get_batsmen(date).to_csv(newpath2+'/batsmen.csv')
    get_bowler(date).to_csv(newpath2+'/bowler.csv')
    get_allrounder(date).to_csv(newpath2+'/allrounder.csv')

I will be eternally grateful to anyone who could help

Divyam Bansal
  • 153
  • 11

2 Answers2

2

Using another method may help, try the following

WebDriverWait(browser, delay) 

Refer to this Answer

Reema
  • 70
  • 4
1

use browser.implicitly_wait(10) before defining html

from bs4 import BeautifulSoup
import pandas as pd
import requests
import re
import datetime
import os
import time 
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
date_list = pd.date_range(start = "1971-02-01", end=datetime.date.today(), freq='1d')


chrome_options = Options()  
chrome_options.add_argument("--headless") # Opens the browser up in background
driver = webdriver.Chrome()

def get_batsmen(date):
    url = f'https://www.icc-cricket.com/rankings/mens/player-rankings/odi/batting?at={date}'
    with Chrome(options=chrome_options) as browser:
        browser.get(url)
        #time.sleep(15)#it will wait for page to load, remove '#' if it does not works
        browser.implicitly_wait(10)
        html = browser.page_source
        
    
    doc = BeautifulSoup(html, "html.parser")
    find_class = doc.find_all("td", class_ = 'table-body__cell rankings-table__name name')
    player_list = []
    find_top = doc.find('div', class_='rankings-block__banner--name-large')
    player_list.append(find_top.text)
    for item in find_class:
        player_name = item.find("a")
        # print(player_name.text)
        try:
            player_list.append(player_name.text)
        except AttributeError:
            continue
    df = pd.DataFrame(player_list, columns = ['Player Name'])
    return df

def get_bowler(date):
    url = f'https://www.icc-cricket.com/rankings/mens/player-rankings/odi/bowling?at={date}'
    # page = requests.get(url).text
    with Chrome(options=chrome_options) as browser:
        browser.get(url)
        html = browser.page_source
    doc = BeautifulSoup(html, "html.parser")
    find_class = doc.find_all("td", class_ = 'table-body__cell rankings-table__name name')
    player_list = []
    find_top = doc.find('div', class_='rankings-block__banner--name-large')
    player_list.append(find_top.text)
    for item in find_class:
        player_name = item.find("a")
        # print(player_name.text)
        try:
            player_list.append(player_name.text)
        except AttributeError:
            continue
    df = pd.DataFrame(player_list, columns = ['Player Name'])
    return df

def get_allrounder(date):
    url = f'https://www.icc-cricket.com/rankings/mens/player-rankings/odi/all-rounder?at={date}'
    # page = requests.get(url).text
    with Chrome(options=chrome_options) as browser:
        browser.get(url)
        html = browser.page_source
    doc = BeautifulSoup(html, "html.parser")
    find_class = doc.find_all("td", class_ = 'table-body__cell rankings-table__name name')
    player_list = []
    find_top = doc.find('div', class_='rankings-block__banner--name-large')
    player_list.append(find_top.text)
    for item in find_class:
        player_name = item.find("a")
        # print(player_name.text)
        try:
            player_list.append(player_name.text)
        except AttributeError:
            continue
    df = pd.DataFrame(player_list, columns = ['Player Name'])
    return df

#Storing the data into multiple csvs

for date in date_list:
    year = date.year
    month = date.month
    day = date.day
    date = date.strftime("%Y-%m-%d")
    newpath = rf'C:\Users\divya\OneDrive\Desktop\8th Sem\ISB assignment\{year}'
    if not os.path.exists(newpath):
        os.makedirs(newpath)
    newpath1 = rf'C:\Users\divya\OneDrive\Desktop\8th Sem\ISB assignment\{year}\{month}'
    if not os.path.exists(newpath1):
        os.makedirs(newpath1)
    newpath2 = rf'C:\Users\divya\OneDrive\Desktop\8th Sem\ISB assignment\{year}\{month}\{day}'
    if not os.path.exists(newpath2):
        os.makedirs(newpath2)
    get_batsmen(date).to_csv(newpath2+'/batsmen.csv')
    get_bowler(date).to_csv(newpath2+'/bowler.csv')
    get_allrounder(date).to_csv(newpath2+'/allrounder.csv')
1hiv2m
  • 23
  • 8
  • if you are you are using colab or jupyter restart kernel, and delete old files made by program – 1hiv2m Feb 03 '22 at 12:52
  • This is still giving the same result. Even after applying the filter for the year 1971 using the url 'https://www.icc-cricket.com/rankings/mens/player-rankings/odi/batting?at=1971-08-20', I am getting the current rankings. Can you please try and run the code on your machine – Divyam Bansal Feb 03 '22 at 12:57
  • Hey @DivyamBansal, now it will work copy the code and paste it and run, the date format it was giving was not excepted by the site it was giving ```2019-04-29 00:00:00``` but site want ```2019-04-29``` i have made changes and now it works – 1hiv2m Feb 03 '22 at 13:44
  • now it is giving the following error Traceback (most recent call last): File "c:\Users\divya\OneDrive\Desktop\8th Sem\ISB assignment\main4.py", line 94, in get_batsmen(date).to_csv(newpath2+'/batsmen.csv') File "c:\Users\divya\OneDrive\Desktop\8th Sem\ISB assignment\main4.py", line 38, in get_batsmen player_list.append(player_name.text) AttributeError: 'NoneType' object has no attribute 'text' – Divyam Bansal Feb 03 '22 at 14:06
  • the loop is going out of players name which is none and none can't be converted to text – 1hiv2m Feb 03 '22 at 14:22
  • Try doing it by [pd.read_html](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html) – 1hiv2m Feb 03 '22 at 14:23
  • Can you edit your code with pd.read_html? I am a bit confused – Divyam Bansal Feb 03 '22 at 14:29
  • no worry, i corrected your issue you can run the code pasted above, please inform is it working – 1hiv2m Feb 03 '22 at 14:38
  • The code is running but it is not giving the correct output throughout. In one iteration for a date, the batsmen, bowlers are fine but all-rounders are the current ones. In the next iteration, batsmen are the current ones and not related to the actual date. I am not able to understand what's going wrong - how can you get the correct output in one iteration and the incorrect output in the next. – Divyam Bansal Feb 03 '22 at 14:57
  • close driver by ```driver.close()``` at last of loop, please mark my answer correct – 1hiv2m Feb 03 '22 at 15:04
  • After a couple of iterations it is throwing selenium.common.exceptions.InvalidSessionIdException: Message: invalid session id – Divyam Bansal Feb 03 '22 at 15:24
  • it is opening chrome again and again and not closing it just do that at last of every def before ```df = pd.DataFrame(player_list, columns = ['Player Name'])``` and ```return df``` write ```browser.close()``` and then try, you have to close chrome after opening the page, it's making cache and throwing error – 1hiv2m Feb 03 '22 at 16:11
  • That gives me raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=53423): Max retries exceeded with url: /session/b94e457237a039b01933465f8e43e731/window (Caused by NewConnectionError(': Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) – Divyam Bansal Feb 03 '22 at 16:14
  • try to restart your machine or, from task manger end task of chrome – 1hiv2m Feb 03 '22 at 16:16
  • Tried that... Still the same – Divyam Bansal Feb 03 '22 at 16:26
  • it an issue with your machine restart it, b/w put ```driver.close()``` as last of code after date loop – 1hiv2m Feb 04 '22 at 14:06