1

I am gathering sports fixtures and results on the webpage, first of all, I am going to use Pandas to scrape, however, there is an option for selecting "timezone" on the page, so I add slenium for the auto-choosing timezone, therefore I do not know how to scrape with pandas after I use slenium. Would everybody please do me a favour, thank you very much.

here is my work:

from selenium import webdriver
from selenium.webdriver.support.ui import Select
import pandas as pd

PATH ="C:/Users/XXX/Desktop/chromedriver.exe"
driver = webdriver.Chrome( PATH )

driver.get("https://fixturedownload.com")

select = Select(driver.find_element_by_name("timezone"))

select.select_by_value("SE Asia Standard Time" )

driver.find_element_by_xpath('/html/body/div[2]/div/div[2]/form/div/input[1]').click()

List = pd.read_html(I am stuck here)
Kakakarsa
  • 35
  • 5
  • You do not scrape with pandas. you scrap with selenium, put data using pandas in dataframes/tables. Since you are trying to read an HTML, pandas is not going to help you much, use Beautiful Soup 4, to read an HTML table and then store in pandas table. – Anurag Dhadse Dec 13 '21 at 01:50
  • Well, that's not quite true. Pandas certainly CAN do web scraping, although it only works with rigidly structured web pages. However, pandas expects to work with whole web pages, ans Selenium works with the object model. If you can find the id of a table that contains your data, you can fetch the `.text` of that table and pass it to `read_html`. – Tim Roberts Dec 13 '21 at 01:59
  • What values did you want to read into the pandas what is the expected output? – Arundeep Chohan Dec 13 '21 at 03:01
  • Read "match table for fixture and result" into the pandas after selecting "Timezone" and output I think is DataFrame. – Kakakarsa Dec 13 '21 at 03:12
  • Can you highlight it in a screenshot. Or have an example output not sure which one you want. – Arundeep Chohan Dec 13 '21 at 04:00
  • @ArundeepChohan OK, I will try later. – Kakakarsa Dec 13 '21 at 04:20
  • It looks like this https://upload.cc/i1/2021/12/13/esMg6C.jpg – Kakakarsa Dec 13 '21 at 13:46

2 Answers2

0

You don't need selenium. Issue a POST request to the server with your desired timezone (provided appears in dropdown list).

The available values to use appear against the value attribute of the option tags within the parent select element:

enter image description here

Then parse the response to extract your desired download format links e.g. you can grab the header row links for the csvs downloads for all fixtures within each table as follows:

import requests
# import pandas as pd
from bs4 import BeautifulSoup as bs

headers = {'User-Agent': 'Safari/537.36'}

data = {
  'timezone': 'Nepal Standard Time',
  'command': 'Set Timezone'
}

r = requests.post('https://fixturedownload.com/', headers=headers,  data=data)
soup = bs(r.content, 'lxml')
csv_links = ['https://fixturedownload.com' + i['href'] for i in soup.select('.fixture tr:nth-child(1) td:nth-child(3) a')]
print(csv_links)

You can then combine csvs if headers match, simply download and store, manipulate etc.

There is no point using read_html as you will lose the links to the actual data.

QHarr
  • 83,427
  • 12
  • 54
  • 101
0

To select the timezone as SE Asia Standard Time and scrape the TABLE using Pandas you can use the following Locator Strategies:

Code Block:

from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

driver.get("https://fixturedownload.com/")
Select(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//select[@name='timezone']")))).select_by_value("SE Asia Standard Time" )
driver.find_element(By.XPATH, "//input[@value='Set Timezone']").click()
data = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@class='fixture']"))).get_attribute("outerHTML")
df  = pd.read_html(data)
print(df)

Console Output:

[                    0                1  ...                          4          5
0        Full fixture  Preview fixture  ...  Download fixture for ICAL  View JSON
1               Teams            Teams  ...                      Teams        NaN
2      Adelaide Crows  Preview fixture  ...  Download fixture for ICAL  View JSON
3      Brisbane Lions  Preview fixture  ...  Download fixture for ICAL  View JSON
4             Carlton  Preview fixture  ...  Download fixture for ICAL  View JSON
5         Collingwood  Preview fixture  ...  Download fixture for ICAL  View JSON
6            Essendon  Preview fixture  ...  Download fixture for ICAL  View JSON
7           Fremantle  Preview fixture  ...  Download fixture for ICAL  View JSON
8        Geelong Cats  Preview fixture  ...  Download fixture for ICAL  View JSON
9     Gold Coast Suns  Preview fixture  ...  Download fixture for ICAL  View JSON
10         GWS Giants  Preview fixture  ...  Download fixture for ICAL  View JSON
11           Hawthorn  Preview fixture  ...  Download fixture for ICAL  View JSON
12          Melbourne  Preview fixture  ...  Download fixture for ICAL  View JSON
13    North Melbourne  Preview fixture  ...  Download fixture for ICAL  View JSON
14      Port Adelaide  Preview fixture  ...  Download fixture for ICAL  View JSON
15           Richmond  Preview fixture  ...  Download fixture for ICAL  View JSON
16           St Kilda  Preview fixture  ...  Download fixture for ICAL  View JSON
17       Sydney Swans  Preview fixture  ...  Download fixture for ICAL  View JSON
18  West Coast Eagles  Preview fixture  ...  Download fixture for ICAL  View JSON
19   Western Bulldogs  Preview fixture  ...  Download fixture for ICAL  View JSON

[20 rows x 6 columns]]
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • Thank you very much, it works, I appreciate it. https://upload.cc/i1/2021/12/13/esMg6C.jpg WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@class='fixture']"))).get_attribute("outerHTML") May I ask what are the meaning of "20" and "outerHTML" stands for? – Kakakarsa Dec 13 '21 at 13:53
  • The _`20`_ stands for the number of seconds [Selenium](https://stackoverflow.com/questions/54459701/what-is-selenium-and-what-is-webdriver/54482491#54482491) should wait before it attempts to extract the `outerHTML`. This _**seconds**_ is configurable and you can change it as per your requirements e.g. `WebDriverWait(driver, 5)` or `WebDriverWait(driver, 7)` or `WebDriverWait(driver, 10)` – undetected Selenium Dec 13 '21 at 15:08