Parsing a table from a website using python

Question

I have tried to use requests and BeautifulSoup to parse the Human Development Index (HDI) from this website http://hdr.undp.org/en/indicators/137506# By inspecting the page I got this for the table

<div id="indcontent">
<table id="table"><thead><tr><th style="width:auto;">HDI Rank</th><th style="width:auto;">Country</th><th style="width:auto;">1990</th><th style="width:auto;"></th><th style="width:auto;">1991</th><th 
.
.
 style="width:auto;"></th><th style="width:auto;">2016</th><th style="width:auto;"></th><th style="width:auto;">2017</th><th style="width:auto;"></th><th style="width:auto;">2018</th><th style="width:auto;"></th></tr><tr><th class="indName">Human Development Index (HDI)
null
Dimension: Composite indices
Definition: A composite index measuring average achievement in three basic dimensions of human development—a long and healthy life, knowledge and a decent standard of living. See Technical note 1 at http://hdr.undp.org/sites/default/files/hdr2019_technical_notes.pdf for details on how the HDI is calculated.
Source: HDRO calculations based on data from UNDESA (2019b), UNESCO Institute for Statistics (2019), United Nations Statistics Division (2019b), World Bank (2019a), Barro and Lee (2018) and IMF (2019).</th></tr></thead><tbody><tr class="row-even"><td>170</td><td><img src="/sites/default/files/Country-Profiles/AFG.GIF" style="width:20px; height:auto;"> <a href="/countries/profiles/AFG">Afghanistan</a></td><td>0.298</td><td></td><td>0.304</td><td></td><td>0.312</td><td></td><td>0.308</td><td></td><td>0.303</td><td></td><td>0.327</td><td></td><td>0.331</td><td></td><td>0.335</td><td></td>
.
.
<td>0.339</td><td></td><td>0.343</td><td></td><td>0.345</td><td></td><td>0.347</td><td></td><td>0.378</td><td></td><td>0.387</td><td></td><td>0.400</td><td></td><td>0.410</td><td></td><td>0.419</td><td></td><td>0.431</td><td></td><td>0.436</td><td></td><td>0.447</td><td></td><td>0.464</td><td></td><td>0.465</td><td></td><td>0.479</td><td></td><td>0.485</td><td></td><td>0.708</td><td></td><td>0.713</td><td></td><td>0.718</td><td></td><td>0.722</td><td></td><td>0.727</td><td></td><td>0.729</td><td></td><td>0.731</td><td></td></tr><tr><td class="footnotestable"></td></tr></tbody><tfoot></tfoot></table></div>

Whenever I run my code

from bs4 import BeautifulSoup
import requests

url="http://hdr.undp.org/en/indicators/137506#"

html_table = requests.get(url)

soup = BeautifulSoup(html_table.content, "html.parser")
# print(soup.prettify()) # print the parsed data of html to test it!

hdi_table = soup.find("div", attrs={"id": "indcontent"})
print(hdi_table)

to try to find if it has content inside, it returns

<div id="indcontent">
</div>

hdi_table = soup.find("table", attrs={"id": "table"})
rows = hdi_table.table.find_all("tr")

to return anything inside but it prints out NoneType, after this step I wanted to include

headers = rows[0]
header_text = []

for th in headers.find_all('th'):
    header_text.append(th.text)

row_text_array = []

for row in rows[1:]:
    row_text = []

    for row_element in row.find_all(['th', 'td']):
        row_text.append(row_element.text.replace('\n', '').strip())

    row_text_array.append(row_text)

with open("out.csv", "w") as f:
    wr = csv.writer(f)
    wr.writerow(header_text)

    for row_text_single in row_text_array:
        wr.writerow(row_text_single)

Would really appreciate the help! Trying to put the code together as a whole, to get the table into csv format. I have tried xpath //*[@id="indcontent"], but couldn't get to work.

score 0 · Answer 1 · answered Jun 17 '20 at 16:07

Seems that loading the data on the Human Development Index (HDI) site requires JavaScript. To parse this, you can use for example Selenium. More information about parsing using Selenium and BeautifulSoup in this question.

If you just need to download it once, you can just click the "Download Data" button in the page and use the ".csv" file that generates.

score 0 · Answer 2 · answered Jun 17 '20 at 16:18

The data in the table is loaded with javascript, however, I generally try to avoid parsing javascript when scraping data, and try to stick to requests and beautifulsoup only. Often times, the data you want is loaded from some internal api/address that you can then directly access yourself. In your case, it seems to come from a few sources:

http://hdr.undp.org/sites/all/themes/hdr_theme/js/bars.json
http://hdr.undp.org/sites/all/themes/hdr_theme/js/footnotes.json
http://hdr.undp.org/sites/all/themes/hdr_theme/js/rankiso.json
http://hdr.undp.org/sites/all/themes/hdr_theme/js/aggregates.json
http://hdr.undp.org/sites/all/themes/hdr_theme/js/summary.json

You can pull these with requests and load them into your python using json. The easiest way to find these links is to open up f12, go to Network, and reload the webpage and look at what files are being transfered, specifically for extensions like json, csv, etc.

Parsing a table from a website using python

2 Answers2