2

I've tried to get the world population from this website: https://www.worldometers.info/world-population/ but I can only get the html code, not the data of the actual numbers.

I already tried to find children of the object I tried to get data from. I also tried to list the whole object, but nothing seemed to work.

'''just importing stuff '''

import urllib.request

import requests

from bs4 import BeautifulSoup

'''getting html from website to text '''

r = requests.get('https://www.worldometers.info/world-population/')

soup = BeautifulSoup(r.text,'html.parser')

'''here it only finds the one object that's is listed below '''

current_population = soup.find('div',{'class':'maincounter-number'}).find_all('span', recursive=False)

print(current_population)

This is the object the information is stored in:

(span class="rts-counter" rel="current_population">retrieving data... </span>

and in 'inspect-mode' you can see this:

(span class="rts-counter" rel="current_population">(span class="rts-nr-sign"></span>(span class="rts-nr-int rts-nr-10e9">7</span>(span class="rts-nr-thsep">,</span>(span class="rts-nr-int rts-nr-10e6">703</span>(span class="rts-nr-thsep">,</span>(span class="rts-nr-int rts-nr-10e3">227</span><span class="rts-nr-thsep">,</span>(span class="rts-nr-int rts-nr-10e0">630</span></span>

I always only get the first one, but want to get the second one from 'inspect-mode'.

Here is a picture of the inspect-mode.

Community
  • 1
  • 1
Saha
  • 133
  • 8

3 Answers3

1

The website you are scraping is a JavaScript web app. The element content you see in inspect mode is the result of running some JavaScript code after the page downloads that populates that element. Prior to the JavaScript running, the element only contains the text "retrieving data...", which is what you see in your Python code. Neither the Python requests library nor BeautifulSoup run JavaScript in downloaded HTML -- they only download and parse the HTML, and that is why your code only sees the initial text.

You have two options:

  1. Inspect the JavaScript code or website calls and figure out what HTTP URL the page is calling to retrieve the value it puts into that element. Have your Python code fetch that URL instead and parse the value from the response for that URL.
  2. Use a full browser engine. This StackOverflow answer provides a solution: Web-scraping JavaScript page with Python
hrunting
  • 3,857
  • 25
  • 23
1

You are going to need a method that lets javascript run such as selenium as this number is set up via a counter that is generated in this script: https://www.realtimestatistics.net/rts/RTSp.js

from selenium import webdriver

d = webdriver.Chrome()
d.get('https://www.worldometers.info/world-population/')
print(d.find_element_by_css_selector('[rel="current_population"]').text)

You could try writing your own version of that javascript script but I wouldn't recommend it.

I didn't need an explicit wait condition for selenium script but that could be added.

QHarr
  • 83,427
  • 12
  • 54
  • 101
  • Thank you! This method isn't very fast, but I don't think there's anything to improve that, right? Anyway Thank you for your answer :) – Saha May 15 '19 at 16:55
  • you are most welcome. Unless there is a dedicated API which would be faster. – QHarr May 15 '19 at 16:56
0

Javascript is rendered on the DOM so Beautiful Soup will not work as you want it to.

You will have to make something that lets javascript run(eg: browser) so you can make your own browser using QT4 or the like. Sentdex had a good tutorial on it here:

https://www.youtube.com/watch?v=FSH77vnOGqU

Otherwise, you could use Selenium:

from selenium import webdriver
import time

drive = webdriver.Firefox()
drive.get('https://www.worldometers.info/world-population/')
time.sleep(5)
html = driver.page_source
involtus
  • 682
  • 7
  • 21