Wait very soon! Pandas 1.3 will introduce a new feature to read and write shallow XML documents: pandas.read_xml
(even DataFrame.to_xml
). Currently, your link contains entirely XML content with XSLT script to style it to HTML by the browser. See source (Ctrl/Cmd + U).
Pandas 1.3+
Using simplified non-HTML XSLT script. Online Demo
import pandas as pd
url = "http://reports.ieso.ca/public/GenOutputbyFuelHourly/PUB_GenOutputbyFuelHourly"
xsl = """\
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:ieso="http://www.ieso.ca/schema">
<xsl:output method="xml" omit-xml-declaration="no" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/ieso:Document">
<data>
<xsl:apply-templates select="descendant::ieso:HourlyData"/>
</data>
</xsl:template>
<xsl:template match="ieso:HourlyData">
<row>
<Day><xsl:value-of select="ancestor::ieso:DailyData/ieso:Day"/></Day>
<Hour><xsl:value-of select="ieso:Hour"/></Hour>
<NUCLEAR><xsl:value-of select="ieso:FuelTotal[ieso:Fuel = 'NUCLEAR']/ieso:EnergyValue/ieso:Output"/></NUCLEAR>
<GAS><xsl:value-of select="ieso:FuelTotal[ieso:Fuel = 'GAS']/ieso:EnergyValue/ieso:Output"/></GAS>
<HYDRO><xsl:value-of select="ieso:FuelTotal[ieso:Fuel = 'HYDRO']/ieso:EnergyValue/ieso:Output"/></HYDRO>
<WIND><xsl:value-of select="ieso:FuelTotal[ieso:Fuel = 'WIND']/ieso:EnergyValue/ieso:Output"/></WIND>
<SOLAR><xsl:value-of select="ieso:FuelTotal[ieso:Fuel = 'SOLAR']/ieso:EnergyValue/ieso:Output"/></SOLAR>
<BIOFUEL><xsl:value-of select="ieso:FuelTotal[ieso:Fuel = 'BIOFUEL']/ieso:EnergyValue/ieso:Output"/></BIOFUEL>
<Total_Output><xsl:value-of select="sum(.//ieso:EnergyValue/ieso:Output)"/></Total_Output>
</row>
</xsl:template>
</xsl:stylesheet>
"""
fuelhour_df = pd.read_xml(url, stylesheet=xsl)
fuelhour_df.to_csv("IESO_Fuel_Hour_Data.csv")
print(fuelhour_df)
# Day Hour NUCLEAR GAS HYDRO WIND SOLAR BIOFUEL Total_Output
# 0 2021-01-01 1 9021 338 4781 161 0 0 14301
# 1 2021-01-01 2 9037 245 4363 141 0 0 13786
# 2 2021-01-01 3 9027 214 4028 238 0 0 13507
# 3 2021-01-01 4 9182 214 3681 428 0 0 13505
# 4 2021-01-01 5 9219 213 3172 277 0 0 12881
# ... ... ... ... ... ... ... ... ... ...
# 1699 2021-03-12 20 8377 381 5441 3724 0 32 17955
# 1700 2021-03-12 21 8374 245 5118 3596 0 31 17364
# 1701 2021-03-12 22 8374 247 4579 3379 0 31 16610
# 1702 2021-03-12 23 8374 209 4457 3041 0 31 16112
# 1703 2021-03-12 24 8376 208 4219 2476 0 15 15294
Pandas < 1.3
For now consider, running the XSLT transformation locally with third-party package, lxml
(default parser in read_xml
), then call read_html
:
import lxml.etree as lx
import pandas as pd
url = "http://reports.ieso.ca/public/GenOutputbyFuelHourly/PUB_GenOutputbyFuelHourly"
xsl = "http://reports.ieso.ca/docrefs/stylesheet/GenOutputbyFuelHourly_HTML_t1-1.xsl"
doc = lx.parse(url)
style = lx.parse(xsl)
transformer = lx.XSLT(style)
html = transformer(doc)
fuelhour_df = pd.read_html(bytes(html), header=0)[2]
print(fuelhour_df)
# Date Hour NUCLEAR GAS HYDRO WIND SOLAR BIOFUEL Total Output
# 0 2021-01-01 1 9021 338 4781 161 0 0 14301
# 1 2021-01-01 2 9037 245 4363 141 0 0 13786
# 2 2021-01-01 3 9027 214 4028 238 0 0 13507
# 3 2021-01-01 4 9182 214 3681 428 0 0 13505
# 4 2021-01-01 5 9219 213 3172 277 0 0 12881
# ... ... ... ... ... ... ... ... ... ...
# 1699 2021-03-12 20 8377 381 5441 3724 0 32 17955
# 1700 2021-03-12 21 8374 245 5118 3596 0 31 17364
# 1701 2021-03-12 22 8374 247 4579 3379 0 31 16610
# 1702 2021-03-12 23 8374 209 4457 3041 0 31 16112
# 1703 2021-03-12 24 8376 208 4219 2476 0 15 15294
fuelhour_df.to_csv("IESO_Fuel_Hour_Data.csv")