1

I'm trying to pull data off the table called "Fuel Mix Graph" on this site: https://www.iso-ne.com/isoexpress/ I am using BeautifulSoup to read the HTML and pull off the table listed below, but when I try to read the contents of tbody, it outputs it as empty.

Here is my code:

from bs4 import BeautifulSoup
from urllib.request import urlopen


pullPage = 'https://www.iso-ne.com/isoexpress/'

#query website and assign HTML to var page
page = urlopen(pullPage)

#parse HTML into var soup
soup = BeautifulSoup(page, 'html.parser')

#take <div> out of HTML name classifier and obtain value
fuelMix = soup.find('div', id='p_p_id_fuelmixgraphportlet_WAR_isoneportlet_INSTANCE_ZXnKx0ygssKj_')
fuelMixData = fuelMix.find('table', id = '_fuelmixgraphportlet_WAR_isoneportlet_INSTANCE_ZXnKx0ygssKj_table')




tbody = fuelMixData.find_all('tbody')
#for row in rows:
 #   data = row.find_all('td')
    #FMData.append(str(row.find_all('tr')[0].text))

print (tbody)

and here is the relevant section of the HTML:

<table id="_fuelmixgraphportlet_WAR_isoneportlet_INSTANCE_ZXnKx0ygssKj_table" align="left"> 
     <thead> 
          <tr> 
               <th style="text-align:left;">Date/Time</th>
               <th style="text-align:left;">Fuel</th>
               <th>MW</th> </tr> 
     </thead> 
     <tbody>
          <tr>
               <td style="text-align:left;">06/02/2019 00:01</td>
               <td style="text-align:left;">NaturalGas</td>
               <td>2581</td>
          </tr>
          <tr>
               <td style="text-align:left;">06/02/2019 00:01</td>
               <td style="text-align:left;">Nuclear</td>
               <td>3339</td>
          </tr>
     </tbody> 
</table>

For now, my expected results are to simply print all of the data in tbody. Eventually I will read 'tr' and 'td' to create arrays of the data (any ideas as to how to clean up the other strings that are not the date/time, fuel type, and value would be appreciated as well!)

When I run the current code, it will only return

[<tbody></tbody>]

If I find_all('tr'), it only returns the values from thead:

[<tr> <th style="text-align:left;">Date/Time</th> <th style="text-align:left;">Fuel</th> <th>MW</th> </tr>]

And if I find_all('td'), an empty array is returned.

Thank you for your help in advance.

  • Have you printed the result? I mean the table you shared here, is that already been pulled with your code or did you copy that from the website? The problem with beautiful soup is that sometimes the site won't fully load since BS does not run any JavaScript on the page – Jose Angel Sanchez Jun 02 '19 at 22:49
  • The printed results are the ones I pasted at the end of the post. The chunk of the HTML I'm trying to read is directly off of the site. If BS doesn't run the JS on the page, do I need to work in json like QHarr says? – Muntasir Shahabuddin Jun 03 '19 at 18:35

1 Answers1

2

Mimic the POST request the page does and you get all that info in json format

from bs4 import BeautifulSoup as bs
import requests
import time

params = {
    '_nstmp_formDate' : int(time.time()),
    '_nstmp_startDate' : '06/02/2019',
    '_nstmp_endDate' : '06/02/2019',
    '_nstmp_twodays' : 'false',
    '_nstmp_chartTitle' : 'Fuel Mix Graph',
   '_nstmp_requestType' : 'genfuelmix',
   '_nstmp_fuelType' : 'all',
   '_nstmp_height' : 250,
   '_nstmp_showtwodays' : 'false'
}
r = requests.post('https://www.iso-ne.com/ws/wsclient', data = params).json()

Writing out to df for example:

from bs4 import BeautifulSoup as bs
import requests
import time
import pandas as pd

params = {
    '_nstmp_formDate' : int(time.time()),
    '_nstmp_startDate' : '06/02/2019',
    '_nstmp_endDate' : '06/02/2019',
    '_nstmp_twodays' : 'false',
    '_nstmp_chartTitle' : 'Fuel Mix Graph',
   '_nstmp_requestType' : 'genfuelmix',
   '_nstmp_fuelType' : 'all',
   '_nstmp_height' : 250,
   '_nstmp_showtwodays' : 'false'
}

r = requests.post('https://www.iso-ne.com/ws/wsclient', data = params).json()
result = []
headers = ['NaturalGas', 'Wind', 'Nuclear', 'Solar', 'Wood', 'Refuse', 'LandfillGas', 'BeginDateMs', 'Renewables', 'BeginDate', 'Hydro', 'Other']

for item in r[0]['data']:
    row = {}
    for header in headers:
        row[header] = item.get(header, '')
        result.append(row)
df = pd.DataFrame(result, columns = headers)
print(df.head())
QHarr
  • 83,427
  • 12
  • 54
  • 101