Use Python to Parse a Website to CSV file

Question

I used the following as a guide, and being brand-new to python, cannot figure how to implement it for my data set.

How to extract tables from websites in Python

My problem is I want to convert the following table into CSV format, I thought the solution above would work, but alas, no go.

import requests
import pandas as pd

url = '
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print(df)
df.to_csv('my data.csv')

Pavlos suggestion:

import csv, requests
from bs4 import BeautifulSoup


html = requests.get(url)
connection = html.text
soup = bs4.BeautifulSoup(connection, 'lxml')

list_header = []
for items in header: 
    try: 
        list_header.append(items.get_text()) 
    except: 
        continue  
        
HTML_data = soup.find_all("table")[0].find_all("tr")[1:] 
  
for element in HTML_data: 
    sub_data = [] 
    for sub_element in element: 
        try: 
            sub_data.append(sub_element.get_text()) 
        except: 
            continue
    data.append(sub_data) 

dataFrame = pd.DataFrame(data = data, columns = list_header) 

dataFrame.to_csv('Overflow.csv')

Since it's XML you're dealing with have a look at this answer using `xmltodict` and `requests`: https://stackoverflow.com/a/62004135/42346 — mechanical_meat, Mar 13 '21 at 02:07
There is no HTML on that page only XML that is re-styled with XSLT to display as HTML. But underlying content is XML. — Parfait, Mar 13 '21 at 13:42

Pavlos Rousoglou · Answer 1 · 2021-03-13T05:15:00.837

0

First open the html:

import requests
import csv
from bs4 import BeautifulSoup

url = 'http://reports.ieso.ca/public/GenOutputbyFuelHourly/PUB_GenOutputbyFuelHourly.xml'
html = requests.get(url)
connection = html.text
soup = BeautifulSoup(connection, 'lxml')

Then we need to extract the headers of the table. To do that, lets create an empty list where the headers will be saved inside a for loop:

header = soup.find_all("table")[0].find("tr")
list_header = []
for items in header: 
    try: 
        list_header.append(items.get_text()) 
    except: 
        continue

Then get the data from the table. In the first loop we iterate to get the rows and the second loop we iterate to find individual elements inside the rows:

HTML_data = soup.find_all("table")[0].find_all("tr")[1:] 
  
for element in HTML_data: 
    sub_data = [] 
    for sub_element in element: 
        try: 
            sub_data.append(sub_element.get_text()) 
        except: 
            continue
    data.append(sub_data)

Store it to pandas DataFrame:

dataFrame = pd.DataFrame(data = data, columns = list_header)

Now convert the DataFrame to a CSV:

dataFrame.to_csv('Overflow.csv')

edited Mar 13 '21 at 05:15

answered Mar 13 '21 at 02:53

Pavlos Rousoglou

185
1
9

Hey Pavlos, I get the error: import csv, requests from bs4 import BeautifulSoup – Mar 13 '21 at 04:03
I am sorry. I forgot to mention that csv needs to be imported same as BeautifulSoup module: import csv, from bs4 import BeautifulSoup at the start of your file should do the trick (include the two importing statements in two separate lines) – Pavlos Rousoglou Mar 13 '21 at 04:05
Thank you Pavlos, however I'm still getting the same error. – Mar 13 '21 at 04:16
What are you using to run your code? I am using Anaconda and bs4 is embedded in it. If you are still getting the same errors you should probably try installing the module. The following link contains detailed instructions on how to do that : https://www.geeksforgeeks.org/beautifulsoup-installation-python/. You might have to do a similar thing for the csv package – Pavlos Rousoglou Mar 13 '21 at 04:22
Hi Pavlos I'm using Mu, however I am able to use bs4 in other scripts. I'm sorry for the grief. – Mar 13 '21 at 04:26
Try importing requests too. See the solution above – Pavlos Rousoglou Mar 13 '21 at 04:31
I'll post the entire script above. take a look if you don't mind. Thanks again. – Mar 13 '21 at 04:45

Parfait · Answer 2 · 2021-03-13T15:56:32.423

Wait very soon! Pandas 1.3 will introduce a new feature to read and write shallow XML documents: pandas.read_xml (even DataFrame.to_xml). Currently, your link contains entirely XML content with XSLT script to style it to HTML by the browser. See source (Ctrl/Cmd + U).

Pandas 1.3+

Using simplified non-HTML XSLT script. Online Demo

import pandas as pd

url = "http://reports.ieso.ca/public/GenOutputbyFuelHourly/PUB_GenOutputbyFuelHourly"

xsl = """\
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                              xmlns:ieso="http://www.ieso.ca/schema">
   <xsl:output method="xml" omit-xml-declaration="no" indent="yes"/>
   <xsl:strip-space elements="*"/>

   <xsl:template match="/ieso:Document">
     <data>
       <xsl:apply-templates select="descendant::ieso:HourlyData"/>
     </data>
   </xsl:template>
    
   <xsl:template match="ieso:HourlyData">
     <row>
       <Day><xsl:value-of select="ancestor::ieso:DailyData/ieso:Day"/></Day>
       <Hour><xsl:value-of select="ieso:Hour"/></Hour>
       <NUCLEAR><xsl:value-of select="ieso:FuelTotal[ieso:Fuel = 'NUCLEAR']/ieso:EnergyValue/ieso:Output"/></NUCLEAR>
       <GAS><xsl:value-of select="ieso:FuelTotal[ieso:Fuel = 'GAS']/ieso:EnergyValue/ieso:Output"/></GAS>
       <HYDRO><xsl:value-of select="ieso:FuelTotal[ieso:Fuel = 'HYDRO']/ieso:EnergyValue/ieso:Output"/></HYDRO>
       <WIND><xsl:value-of select="ieso:FuelTotal[ieso:Fuel = 'WIND']/ieso:EnergyValue/ieso:Output"/></WIND>
       <SOLAR><xsl:value-of select="ieso:FuelTotal[ieso:Fuel = 'SOLAR']/ieso:EnergyValue/ieso:Output"/></SOLAR>
       <BIOFUEL><xsl:value-of select="ieso:FuelTotal[ieso:Fuel = 'BIOFUEL']/ieso:EnergyValue/ieso:Output"/></BIOFUEL>
       <Total_Output><xsl:value-of select="sum(.//ieso:EnergyValue/ieso:Output)"/></Total_Output>
     </row>
   </xsl:template>
   
</xsl:stylesheet>
"""

fuelhour_df = pd.read_xml(url, stylesheet=xsl)

fuelhour_df.to_csv("IESO_Fuel_Hour_Data.csv")

print(fuelhour_df)

#              Day  Hour  NUCLEAR  GAS  HYDRO  WIND  SOLAR  BIOFUEL  Total_Output
# 0     2021-01-01     1     9021  338   4781   161      0        0         14301
# 1     2021-01-01     2     9037  245   4363   141      0        0         13786
# 2     2021-01-01     3     9027  214   4028   238      0        0         13507
# 3     2021-01-01     4     9182  214   3681   428      0        0         13505
# 4     2021-01-01     5     9219  213   3172   277      0        0         12881
# ...          ...   ...      ...  ...    ...   ...    ...      ...           ...
# 1699  2021-03-12    20     8377  381   5441  3724      0       32         17955
# 1700  2021-03-12    21     8374  245   5118  3596      0       31         17364
# 1701  2021-03-12    22     8374  247   4579  3379      0       31         16610
# 1702  2021-03-12    23     8374  209   4457  3041      0       31         16112
# 1703  2021-03-12    24     8376  208   4219  2476      0       15         15294

Pandas < 1.3

For now consider, running the XSLT transformation locally with third-party package, lxml (default parser in read_xml), then call read_html:

import lxml.etree as lx
import pandas as pd

url = "http://reports.ieso.ca/public/GenOutputbyFuelHourly/PUB_GenOutputbyFuelHourly"
xsl = "http://reports.ieso.ca/docrefs/stylesheet/GenOutputbyFuelHourly_HTML_t1-1.xsl"

doc = lx.parse(url)
style = lx.parse(xsl)

transformer = lx.XSLT(style)
html = transformer(doc)

fuelhour_df = pd.read_html(bytes(html), header=0)[2]

print(fuelhour_df)
#             Date  Hour  NUCLEAR  GAS  HYDRO  WIND  SOLAR  BIOFUEL  Total Output
# 0     2021-01-01     1     9021  338   4781   161      0        0         14301
# 1     2021-01-01     2     9037  245   4363   141      0        0         13786
# 2     2021-01-01     3     9027  214   4028   238      0        0         13507
# 3     2021-01-01     4     9182  214   3681   428      0        0         13505
# 4     2021-01-01     5     9219  213   3172   277      0        0         12881
# ...          ...   ...      ...  ...    ...   ...    ...      ...           ...
# 1699  2021-03-12    20     8377  381   5441  3724      0       32         17955
# 1700  2021-03-12    21     8374  245   5118  3596      0       31         17364
# 1701  2021-03-12    22     8374  247   4579  3379      0       31         16610
# 1702  2021-03-12    23     8374  209   4457  3041      0       31         16112
# 1703  2021-03-12    24     8376  208   4219  2476      0       15         15294

fuelhour_df.to_csv("IESO_Fuel_Hour_Data.csv")

I love the solution, very clean and easy. Do you know how I can add an append feature to it? As in the xml file updates daily and instead of replacing the file it just appends the new data? — , Mar 13 '21 at 19:00

Use Python to Parse a Website to CSV file

2 Answers2