1

I am trying to read keyStat in MorningStar and know the data which is HTML where is warped in a JSON. So far I can put a request that can get the json by Beautifulsoup:

url = 'http://financials.morningstar.com/ajax/keystatsAjax.html?t=tou&culture=en-CA&region=CAN'
lm_json = requests.get(url).json()
ksContent = BeautifulSoup(lm_json["ksContent"],"html.parser")

Now here is a bit wired to me that the html data as 'ksContent' which contains actual data as a table. I am not a fan of html and wondering how can I just make all it to a nice pandas dataframe? As the table is long, here is some of it:

     <table cellpadding="0" cellspacing="0" class="r_table1 text2">
     <colgroup>
        <col width="23%"/>
        <col span="11" width="7%"/>
     </colgroup>
     <thead>
        <tr>
           <th align="left" scope="row"></th>
           <th align="right" id="Y0" scope="col">2008-12</th>
           <th align="right" id="Y1" scope="col">2009-12</th>
           <th align="right" id="Y2" scope="col">2010-12</th>
           <th align="right" id="Y3" scope="col">2011-12</th>
           <th align="right" id="Y4" scope="col">2012-12</th>
           <th align="right" id="Y5" scope="col">2013-12</th>
           <th align="right" id="Y6" scope="col">2014-12</th>
           <th align="right" id="Y7" scope="col">2015-12</th>
           <th align="right" id="Y8" scope="col">2016-12</th>
           <th align="right" id="Y9" scope="col">2017-12</th>
           <th align="right" id="Y10" scope="col">TTM</th>
        </tr>
     </thead>
     <tbody>
        <tr class="hr">
           <td colspan="12"></td>
        </tr>
        <tr>
           <th class="row_lbl" id="i0" scope="row">Revenue <span>CAD Mil</span></th>
           <td align="right" headers="Y0 i0">—</td>
           <td align="right" headers="Y1 i0">40</td>
           <td align="right" headers="Y2 i0">212</td>
           <td align="right" headers="Y3 i0">349</td>
           <td align="right" headers="Y4 i0">442</td>
           <td align="right" headers="Y5 i0">759</td>
           <td align="right" headers="Y6 i0">1,379</td>
           <td align="right" headers="Y7 i0">1,074</td>
           <td align="right" headers="Y8 i0">1,125</td>
           <td align="right" headers="Y9 i0">1,662</td>
           <td align="right" headers="Y10 i0">1,760</td>
        </tr> ...

It defines a header tr, Y0, Y1 ... Y10 as actual date and next tr refers to it.

your help appreciated!

Colin Zhong
  • 727
  • 1
  • 7
  • 16
  • `df_list = pd.read_html(ksContent.prettify())` (after importing pandas as pd) will give you a list of 8 dataframes. I didn't see the original table so I'm not sure what it's supposed to look like but you can pick through those e.g. `df_list[0].head()` and see if one or a few of them are what you're looking for. – Declan Jan 19 '19 at 07:01

1 Answers1

0

You can use read_html() to convert it into a list of dataframes

import requests
import pandas as pd
url = 'http://financials.morningstar.com/ajax/keystatsAjax.html?t=tou&culture=en-CA&region=CAN'
lm_json = requests.get(url).json()
df_list=pd.read_html(lm_json["ksContent"])

You can iterate through it and get the dataframes one by one. You can also use dropna() to get rid of the NaN only rows.

Sample output screenshot from my jupyter Notebook

enter image description here

Bitto
  • 7,937
  • 1
  • 16
  • 38