3

I am trying to scrape the data from this table:: http://www.worldlifeexpectancy.com/cause-of-death/alzheimers-dementia/by-country/ The element I'm trying to find is the name of the country, in this case Finland:

<table cellspacing="0" align="center" class="hc_tbl">
<tbody>
<tr>
<td class="hc_name" style="background-color: transparent;">Finland</td>

Here is the code I'm using:

res = requests.get('http://www.worldlifeexpectancy.com/cause-of-death/alzheimers-dementia/by-country/')

soup = BeautifulSoup(res.content, 'html5lib')

table = soup.find('table', {'class': 'hc_tbl'})

for row in table.find('tbody').find_all('tr'):
    name = row.find('td', {'class':'hc_name'}).text.strip()
    print (name)

However this gives an error that says 'NoneType' object has no attribute 'find';; so it seems like the table element is being returned as 'None'.

I've read some other posts that seem to have a similar problem, but none of the fixes have worked in this case.

Any ideas are greatly appreciated

Thank you

SAtt
  • 139
  • 1
  • 1
  • 5

2 Answers2

1

By examining the source of the site when sending a request, it can be seen that the site is dynamic. Thus, it is best to use a browser manipulation tool such as selenium:

from bs4 import BeautifulSoup as soup 
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('http://www.worldlifeexpectancy.com/cause-of-death/alzheimers-dementia/by-country/')
countries = filter(None, [i.text for i in soup(driver.page_source, 'lxml').find_all('td', {'class':'hc_name'})])

Output:

[u'Finland', u'Djibouti', u'North Korea', u'United States', u'Gabon', u'Venezuela', u'Canada', u'Estonia', u'Zambia', u'Iceland', u'Guyana', u'Russia', u'Sweden', u'Senegal', u'Burundi', u'Switzerland', u'Jordan', u'Eritrea', u'Norway', u'Mali', u'Central Africa', u'Denmark', u'Namibia', u'DR Congo', u'Netherlands', u'Romania', u'Somalia', u'Belgium', u'Moldova', u'Pakistan', u'Spain', u'Bahrain', u'Bolivia', u'Australia', u'Panama', u'Tunisia', u'France', u'Ghana', u'Bhutan', u'United Kingdom', u'Mexico', u'Syria', u'Cuba', u'Sierra Leone', u'Turkey', u'Chile', u'Mauritania', u'Nicaragua', u'Uruguay', u'Tanzania', u'Egypt', u'Israel', u'Sri Lanka', u'Madagascar', u'New Zealand', u'Poland', u'Bosnia/Herzeg.', u'Ireland', u'Benin', u'Lebanon', u'Italy', u'Mozambique', u'Ethiopia', u'Hungary', u'Belize', u'Nepal', u'Malta', u'Nigeria', u'Guatemala', u'Luxembourg', u'Montenegro', u'Ukraine', u'Germany', u'Angola', u'Paraguay', u'Brazil', u'Gambia', u'Colombia', u'South Korea', u'Uganda', u'Bangladesh', u'Cyprus', u'New Guinea', u'Saudi Arabia', u'Costa Rica', u'Slovakia', u'Philippines', u'Iran', u'Guinea-Bissau', u'Indonesia', u'South Africa', u'Burkina Faso', u'Slovenia', u'Austria', u'Cote d Ivoire', u'Honduras', u'Serbia', u'Chad', u'Armenia', u'Trinidad/Tob.', u'Morocco', u'Peru', u'Bahamas', u'Comoros', u'Thailand', u'Maldives', u'Guinea', u'El Salvador', u'Portugal', u'Kenya', u'Yemen', u'Latvia', u'Greece', u'Myanmar', u'Czech Republic', u'Zimbabwe', u'Bulgaria', u'Argentina', u'Viet Nam', u'Turkmenistan', u'Qatar', u'Belarus', u'Malaysia', u'Solomon Isl.', u'Kazakhstan', u'Macedonia', u'Croatia', u'Rwanda', u'Laos', u'Swaziland', u'Niger', u'Mongolia', u'Arab Emirates', u'Togo', u'Timor-Leste', u'Fiji', u'Dominican Rep.', u'Afghanistan', u'Haiti', u'South Sudan', u'Kuwait', u'Equ. Guinea', u'Malawi', u'Azerbaijan', u'Cape Verde', u'Ecuador', u'India', u'Lesotho', u'Brunei', u'Cambodia', u'Jamaica', u'Congo', u'Tajikistan', u'Botswana', u'Albania', u'Kyrgyzstan', u'China', u'Sudan', u'Uzbekistan', u'Barbados', u'Oman', u'Georgia', u'Iraq', u'Mauritius', u'Singapore', u'Lithuania', u'Algeria', u'Suriname', u'Cameroon', u'Liberia', u'Japan', u'Libya']
Ajax1234
  • 69,937
  • 8
  • 61
  • 102
  • How did you figure this out? Thank you! – SAtt Mar 09 '18 at 01:07
  • @SAtt Glad to help! I inspected the HTML returned by printing `requests.get('http://www.worldlifeexpectancy.com/cause-of-death/alzheimers-dementia/by-country/').text` and compared the result to the source displayed in my browser, and noticed that your desired table was missing, however, a script for updating the DOM with table values include an `hc_class` was present. In that case, a browser manipulation tool is necessary to trigger the script. I hope that answers your question! – Ajax1234 Mar 09 '18 at 01:17
0

The table is not available in the page source. It is loaded dynamically with an AJAX request. If you look in the Network tab under the Developer tools, the AJAX request is being made to this url - http://www.worldlifeexpectancy.com/j/country-cause?cause=95&order=hight.

You can see that the data is available in a JSON format. You can scrape this data using only the requests module with the help of the built-in .json() function.

You can get all the data, like, rank, country and rate from this JSON data.

import requests

r = requests.get('http://www.worldlifeexpectancy.com/j/country-cause?cause=95&order=hight')
data = r.json()

for row in data['chart']['countries']['countryitem']:
    id_ = row['id']
    country = row['name']
    rank = row['rank']
    value = row['value']
    print(rank, id_, country, value)

Partial Output:

1 FI Finland 53.77
2 US United States 45.58
3 CA Canada 35.50
4 IS Iceland 34.08
5 SE Sweden 32.41
6 CH Switzerland 32.25
...
...

Also, keep in mind that the <tbody> element is never available in the page source. The browser inserts it. So, while scraping a table, don't use tbody in a find() function. See Why do browsers insert tbody element into table elements?.

Keyur Potdar
  • 7,158
  • 6
  • 25
  • 40