0

I am trying to learn how to parse HTML data so I've chosen this website (http://ets.aeso.ca/ets_web/ip/Market/Reports/CSMPriceReportServlet) which has real time data for electricity prices.

    from bs4 import BeautifulSoup
    from urllib.request import urlopen
    import pandas as pd

    url = "http://ets.aeso.ca/ets_web/ip/Market/Reports/CSMPriceReportServlet"
    page = urlopen(url)
    html = page.read().decode("utf-8")
    soup = BeautifulSoup(html, "html.parser")


    print(soup.get_text())

My question is how do I further parse / filter a BS4 object once it's created?

I just want the data that updates so I can throw it into a pandas dataframe.

Like so:

Date (HE)| Time |Price ($)| Volume (MW)

01/15/2021| 14 |13:10 |40.16| 80

01/15/2021| 14 |13:05 |40.18| 100

01/15/2021 | 14 |13:00| 40.16| 80

  • 2
    depending on the format of the page, you can try pandas.read_html(), or if that doesn't work well on that site, you can use beautifulsoup.find/find_all – goalie1998 Jan 15 '21 at 20:52

1 Answers1

1

As goalie1998 mentioned the esiest way is to use pandas, just two lines of code.

Example

Notice, pandas read_html() stores all tables in a list of data frames. Cause the table you want is the third in the source you could select it directly by slicing data frame list with [2]

import pandas as pd
pd.read_html('http://ets.aeso.ca/ets_web/ip/Market/Reports/CSMPriceReportServlet')[2]

Output

    Date (HE)   Time    Price ($)   Volume (MW)
0   01/15/2021 14   13:10   40.16   80
1   01/15/2021 14   13:05   40.18   100
2   01/15/2021 14   13:00   40.16   80
3   01/15/2021 13   12:40   40.16   80
4   01/15/2021 13   12:00   40.01   100
5   01/15/2021 12   11:54   40.01   100
6   01/15/2021 12   11:00   40.18   100
7   01/15/2021 11   10:54   40.18   100
8   01/15/2021 11   10:24   40.16   80
9   01/15/2021 11   10:00   40.18   100

Just in case you really have to go with beautifulsoup

  1. Select the third table or tablewithout a class on the site
  2. Select all the tr in the tableand loop over them and append all texts from the tds in the data list
  3. Import the data lists in to pandas

Example

from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd

url = "http://ets.aeso.ca/ets_web/ip/Market/Reports/CSMPriceReportServlet"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")

data = []
rows = soup.select('table:not([class]) tr')
for i,row in enumerate(rows):
    if i == 0:
        cols = row.find_all('th')
    else:
        cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele])
    
pd.DataFrame(data[1:], columns=data[0])
HedgeHog
  • 22,146
  • 4
  • 14
  • 36