As goalie1998 mentioned the esiest way is to use pandas
, just two lines of code.
Example
Notice, pandas read_html()
stores all tables in a list of data frames. Cause the table you want is the third in the source you could select it directly by slicing data frame list with [2]
import pandas as pd
pd.read_html('http://ets.aeso.ca/ets_web/ip/Market/Reports/CSMPriceReportServlet')[2]
Output
Date (HE) Time Price ($) Volume (MW)
0 01/15/2021 14 13:10 40.16 80
1 01/15/2021 14 13:05 40.18 100
2 01/15/2021 14 13:00 40.16 80
3 01/15/2021 13 12:40 40.16 80
4 01/15/2021 13 12:00 40.01 100
5 01/15/2021 12 11:54 40.01 100
6 01/15/2021 12 11:00 40.18 100
7 01/15/2021 11 10:54 40.18 100
8 01/15/2021 11 10:24 40.16 80
9 01/15/2021 11 10:00 40.18 100
Just in case you really have to go with beautifulsoup
- Select the third
table
or table
without a class on the site
- Select all the
tr
in the table
and loop over them and append all texts from the td
s in the data list
- Import the data lists in to
pandas
Example
from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd
url = "http://ets.aeso.ca/ets_web/ip/Market/Reports/CSMPriceReportServlet"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
data = []
rows = soup.select('table:not([class]) tr')
for i,row in enumerate(rows):
if i == 0:
cols = row.find_all('th')
else:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele])
pd.DataFrame(data[1:], columns=data[0])