Trying to do some webscraping. Trying to make a function that will spit out population for each country. I am trying to web-scrape from US census bureau, but I cant get back the right information.
https://www.census.gov/popclock/world/af
<div id ="basic-facts" class = "data-cell">
<div class = "data-contianer">
<div class="data-cell" style = "background-image: url.....">
<p>population</p>
<h2 data-population="">35.8M</h2>"
this is basically what the code looks like that im trying to scrape. What I want is that "35.8M"
I have tried a few methods and all I can get back is the heading itself "data-population", none of the data.
Someone mentioned to me that maybe the website has it in some format so that it cant be scraped. In my experience, when it is blocked, the formatting looks different, it is in a image or dynamic item or something that makes it more difficult to scrape. Does anyone have any thoughts on this?
# -*- coding: utf-8 -*-
# Tells python what encoding the string is stored in
# Import required libraries
import requests
from bs4 import BeautifulSoup
### country naming issues: In the URLS on the websites the countries are coded with
### a two digit code # "au" = australia, "in" = india. If we were to search for a
### country name or something like that we would need to have something to relate
### the country name to the two letter code so it can search for it.
country = 'albania'
countrycode = [al: 'albania', af: 'afghanistan',]
### this would take long to write
### it all out, maybe its possible to scrape these names?
# Create url for the requested location through string concatenation
url = 'https://www.census.gov/popclock/world/'+countrycode
# Send request to retrieve the web-page using the
# get() function from the requests library
# The page variable stores the response from the web-page
page = requests.get(url)
# Create a BeautifulSoup object with the response from the URL
# Access contents of the web-page using .content
# html_parser is used since our page is in HTML format
soup=BeautifulSoup(page.content,"html.parser")
################################################################## Start what im not sure about
# Locate element on page to be scraped
# find() locates the element in the BeautifulSoup object
1. First method
population = soup.find(id="basic-facts", class="data-cell")
#I tried some methods like this. got only errors
2. Second method
populaiton = soup.findAll("h2", {"data-population": ""})
for i in population:
print i
# this returns the headings for the table but no data
### here we need to take out the population data
### it is listed as "<h2 data-population = "" >35.8</h2>"
################################################################## end
# Extract text from the selected BeautifulSoup object using .text
population = population.text
#Final Output
#Return Scraped info
print 'The Population of'+country+'is'+population
I outlined the code with #######. I tried a few methods. I listed two
I am pretty new to coding in general, so excuse me if I didnt describe this all right, thanks for any advice anyone can give.