1

Trying to do some webscraping. Trying to make a function that will spit out population for each country. I am trying to web-scrape from US census bureau, but I cant get back the right information.

https://www.census.gov/popclock/world/af

<div id ="basic-facts" class = "data-cell">
<div class = "data-contianer">
   <div class="data-cell" style = "background-image: url.....">
      <p>population</p>
      <h2 data-population="">35.8M</h2>"

this is basically what the code looks like that im trying to scrape. What I want is that "35.8M"

I have tried a few methods and all I can get back is the heading itself "data-population", none of the data.

Someone mentioned to me that maybe the website has it in some format so that it cant be scraped. In my experience, when it is blocked, the formatting looks different, it is in a image or dynamic item or something that makes it more difficult to scrape. Does anyone have any thoughts on this?

# -*- coding: utf-8 -*-

# Tells python what encoding the string is stored in
# Import required libraries
import requests
from bs4 import BeautifulSoup

### country naming issues: In the URLS on the websites the countries are coded with
### a two digit code # "au" = australia, "in" = india. If we were to search for a
### country name or something like that we would need to have something to relate
### the country name to the two letter code so it can search for it.

country = 'albania'
countrycode = [al: 'albania', af: 'afghanistan',]
### this would take long to write
### it all out, maybe its possible to scrape these names? 
# Create url for the requested location through string concatenation
url = 'https://www.census.gov/popclock/world/'+countrycode
# Send request to retrieve the web-page using the 
# get() function from the requests library
# The page variable stores the response from the web-page
page = requests.get(url)

# Create a BeautifulSoup object with the response from the URL
# Access contents of the web-page using .content
# html_parser is used since our page is in HTML format

soup=BeautifulSoup(page.content,"html.parser")
################################################################## Start what im not sure about
 # Locate element on page to be scraped
 # find() locates the element in the BeautifulSoup object

 1. First method      

 population = soup.find(id="basic-facts", class="data-cell") 
 #I tried some methods like this. got only errors

 2. Second method

 populaiton = soup.findAll("h2", {"data-population": ""})
 for i in population:
 print i

 # this returns the headings for the table but no data

 ### here we need to take out the population data
 ### it is listed as "<h2 data-population = "" >35.8</h2>"
################################################################## end
# Extract text from the selected BeautifulSoup object using .text
population = population.text

#Final Output
#Return Scraped info

print 'The Population of'+country+'is'+population

I outlined the code with #######. I tried a few methods. I listed two

I am pretty new to coding in general, so excuse me if I didnt describe this all right, thanks for any advice anyone can give.

Ross Jacobs
  • 2,962
  • 1
  • 17
  • 27
  • you can always first check the .text of what you read and check if you correctly got the actual page... – B. Go Oct 04 '19 at 19:20

1 Answers1

1

It is dynamically retrieved from an API call you can find in the network tab. As you are not using a browser, where this call would have been made for you, you will need to make the request direct yourself.

import requests

r = requests.get('https://www.census.gov/popclock/apiData_pop.php?get=POP,MPOP0_4,MPOP5_9,MPOP10_14,MPOP15_19,MPOP20_24,MPOP25_29,MPOP30_34,MPOP35_39,MPOP40_44,MPOP45_49,MPOP50_54,MPOP55_59,MPOP60_64,MPOP65_69,MPOP70_74,MPOP75_79,MPOP80_84,MPOP85_89,MPOP90_94,MPOP95_99,MPOP100_,FPOP0_4,FPOP5_9,FPOP10_14,FPOP15_19,FPOP20_24,FPOP25_29,FPOP30_34,FPOP35_39,FPOP40_44,FPOP45_49,FPOP50_54,FPOP55_59,FPOP60_64,FPOP65_69,FPOP70_74,FPOP75_79,FPOP80_84,FPOP85_89,FPOP90_94,FPOP95_99,FPOP100_&key=&YR=2019&FIPS=af').json()

data = list(zip(r[0],r[1]))
print(round(int(data[0][1])/100_0000,1))
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • how did you know that it was dynamically retrieved from an API? and how did you know which request to take from the network tab? i wasted 20 mins trying to figure out why the result of the h2 tag is empty – DarkLeader Oct 05 '19 at 01:39
  • @DarkLeader Either switch off javascript in browser (or have a profile where js is disabled) and compare against js enabled browser when loading the page - a lot of content is not present when js disabled. To find the right call use Ctrl + F in the network tab to search for a phrase/number..you hope only occurs in the data you are interested in.See [1](https://stackoverflow.com/a/56279841/6241235) and [2](https://stackoverflow.com/a/56924071/6241235). Bear in mind - with numbers in particular the js may lead to rounded numbers on the page whereas in source the number is in a different format. – QHarr Oct 05 '19 at 03:54