I'm scraping a table that displays info for a sporting league. So far so good for a selenium beginner:
from selenium import webdriver
import re
import pandas as pd
driver = webdriver.PhantomJS(executable_path=r'C:/.../bin/phantomjs.exe')
driver.get("http://www.oddsportal.com/hockey/usa/nhl-2014-2015/results/#/page/2.html")
infotable = driver.find_elements_by_class_name("table-main")
matches = driver.find_elements_by_class_name("table-participant")
ilist, match = [], []
for i in infotable:
ilist.append(i.text)
infolist = ilist[0]
for i in matches:
match.append(i.text)
driver.close()
home = pd.Series([item.split(' - ')[0] for item in match])
away = pd.Series([item.strip().split(' - ')[1] for item in match])
df = pd.DataFrame({'home' : home, 'away' : away})
date = re.findall("\d\d\s\w\w\w\s\d\d\d\d", infolist)
In the last line, date
scrapes all the dates in the table but I can't link them to the corresponding game.
My thinking is: for child/element "under the date", date = last_found_date
.
Ultimate goal is to have two more columns in df
, one with the date
of the match and the next if any text found beside the date, for example 'Play Offs'
(I can figure that out myself if I can get the date
issue sorted).
Should I be incorporating another program/method to retain order of tags/elements of the table?