-1

I am trying to scrape data related to COVID-19. I was able to download some data from the website, for example, the number of total cases, but not data from interactive graphs.

I usually scrape interactive graphs with json by finding the source from 'network' in the inspect element page. However, I was not able to find the 'network' for the interactive graphs to scrape.

Can someone please help me to scrape data from "total deaths" graph? or any other graph from the website. Thanks.

Just to make it clear. I don't want to scrape data from the table of Countries. I already did that. What I want to do is get data from the graphs. For example, data from death ratio graph vs date or active cases vs time date graph.

Thanks

import requests
import urllib.request
import time
import json
from bs4 import BeautifulSoup 
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

url= 'https://www.worldometers.info/coronavirus/#countries'
r = requests.get(url)
soup= BeautifulSoup(r.text, "html.parser")

for example, Number of effected countries:

len(soup.find_all('table',{'id':'main_table_countries'})[0].find('tbody').find_all('tr'))

website

petezurich
  • 9,280
  • 9
  • 43
  • 57
  • 2
    You should scrape data if you cannot easily find the data source. Here is the data repository that is curated daily. https://github.com/CSSEGISandData/COVID-19 This might be more useful to you than scraping data from a webpage. – x85ms16 Mar 06 '20 at 02:20
  • in HTML you can find few JavaScripts with `Highcharts.chart(....)` and there are data. It sends it directly in HTML. – furas Mar 06 '20 at 02:23
  • BTW: it seems you have this data as table on subpage https://www.worldometers.info/coronavirus/coronavirus-death-toll/ – furas Mar 06 '20 at 10:04
  • Does this answer your question? [Can I scrape the raw data from highcharts.js?](https://stackoverflow.com/questions/39305877/can-i-scrape-the-raw-data-from-highcharts-js) – user4157124 Mar 06 '20 at 11:42
  • 1
    @user4157124 yesterday Ayush Garg put the same link in comment below his answer. So today I tested it and it works. I putted it in my answer :) – furas Mar 06 '20 at 11:50

3 Answers3

2

As I mentioned in comment there are few JavaScripts with Highcharts.chart(....) so I tried to get values using different method.

Most of them need to find elements manually in data and create correct indexes or xpath to get value. So it is not so easy.

The easiest was to use js2xml which I saw in @AyushGarg link Can I scrape the raw data from highcharts.js?

The hardest was to use pyjsparser

import requests
from bs4 import BeautifulSoup 
import json
#import dirtyjson
import pyjsparser
import js2xml

url= 'https://www.worldometers.info/coronavirus/#countries'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")

scripts = soup.find_all('script')

script = scripts[24].text
print(script)

print('\n--- pyjsparser ---\n')
data = pyjsparser.parse(script) 
data = data['body'][0]['expression']['arguments'][1]['properties'][-2]['value']['elements'][0]['properties'][-1]['value']['elements'] # a lot of work to find it
#print(json.dumps(data, indent=2))
data = [x['value'] for x in data]
print(data)
# text values
# it needs work

print('\n--- eval ---\n')
data = script.split('data: [', 1)[1].split(']', 1)[0]
data = eval(data)  # it creates tuple
print(data)
# text values
data = script.split("title: { text: '", 1)[-1].split("'", 1)[0]
print(data)
data = script.split("title: { text: '", 3)[-1].split("'", 1)[0]
print(data)

print('\n--- json ---\n')
data = script.split('data: [', 1)[1].split(']', 1)[0]
data = '[' + data + ']' # create correct JSON data
data = json.loads(data) # this time doesn't need `dirtyjson`
print(data)
# text values
data = script.split("title: { text: '", 1)[-1].split("'", 1)[0]
print(data)
data = script.split("title: { text: '", 3)[-1].split("'", 1)[0]
print(data)

print('\n--- js2xml ---\n')
data = js2xml.parse(script)
print(data.xpath('//property[@name="data"]//number/@value')) # nice and short xpath
# text values
#print(js2xml.pretty_print(data.xpath('//property[@name="title"]')[0]))
text = data.xpath('//property[@name="title"]//string/text()')
print(text[0])
print(text[1])

Result

 Highcharts.chart('coronavirus-deaths-linear', { chart: { type: 'line' }, title: { text: 'Total Deaths' }, subtitle: { text: '(Linear Scale)' }, xAxis: { categories: ["Jan 22","Jan 23","Jan 24","Jan 25","Jan 26","Jan 27","Jan 28","Jan 29","Jan 30","Jan 31","Feb 01","Feb 02","Feb 03","Feb 04","Feb 05","Feb 06","Feb 07","Feb 08","Feb 09","Feb 10","Feb 11","Feb 12","Feb 13","Feb 14","Feb 15","Feb 16","Feb 17","Feb 18","Feb 19","Feb 20","Feb 21","Feb 22","Feb 23","Feb 24","Feb 25","Feb 26","Feb 27","Feb 28","Feb 29","Mar 01","Mar 02","Mar 03","Mar 04","Mar 05"] }, yAxis: { title: { text: 'Total Coronavirus Deaths' } }, legend: { layout: 'vertical', align: 'right', verticalAlign: 'middle' }, credits: { enabled: false }, series: [{ name: 'Deaths', color: '#FF9900', lineWidth: 5, data: [17,25,41,56,80,106,132,170,213,259,304,362,426,492,565,638,724,813,910,1018,1115,1261,1383,1526,1669,1775,1873,2009,2126,2247,2360,2460,2618,2699,2763,2800,2858,2923,2977,3050,3117,3202,3285,3387] }], responsive: { rules: [{ condition: { maxWidth: 800 }, chartOptions: { legend: { layout: 'horizontal', align: 'center', verticalAlign: 'bottom' } } }] } }); 

--- pyjsparser ---

[17.0, 25.0, 41.0, 56.0, 80.0, 106.0, 132.0, 170.0, 213.0, 259.0, 304.0, 362.0, 426.0, 492.0, 565.0, 638.0, 724.0, 813.0, 910.0, 1018.0, 1115.0, 1261.0, 1383.0, 1526.0, 1669.0, 1775.0, 1873.0, 2009.0, 2126.0, 2247.0, 2360.0, 2460.0, 2618.0, 2699.0, 2763.0, 2800.0, 2858.0, 2923.0, 2977.0, 3050.0, 3117.0, 3202.0, 3285.0, 3387.0]

--- eval ---

(17, 25, 41, 56, 80, 106, 132, 170, 213, 259, 304, 362, 426, 492, 565, 638, 724, 813, 910, 1018, 1115, 1261, 1383, 1526, 1669, 1775, 1873, 2009, 2126, 2247, 2360, 2460, 2618, 2699, 2763, 2800, 2858, 2923, 2977, 3050, 3117, 3202, 3285, 3387)
Total Deaths
Total Coronavirus Deaths

--- json ---

[17, 25, 41, 56, 80, 106, 132, 170, 213, 259, 304, 362, 426, 492, 565, 638, 724, 813, 910, 1018, 1115, 1261, 1383, 1526, 1669, 1775, 1873, 2009, 2126, 2247, 2360, 2460, 2618, 2699, 2763, 2800, 2858, 2923, 2977, 3050, 3117, 3202, 3285, 3387]
Total Deaths
Total Coronavirus Deaths

--- js2xml ---

['17', '25', '41', '56', '80', '106', '132', '170', '213', '259', '304', '362', '426', '492', '565', '638', '724', '813', '910', '1018', '1115', '1261', '1383', '1526', '1669', '1775', '1873', '2009', '2126', '2247', '2360', '2460', '2618', '2699', '2763', '2800', '2858', '2923', '2977', '3050', '3117', '3202', '3285', '3387']
Total Deaths
Total Coronavirus Deaths

EDIT: Page changed structure and code needed some changes too

import requests
from bs4 import BeautifulSoup 
import json
#import dirtyjson
import js2xml
import pyjsparser

# --- functions ---

def test_eval(script):
    print('\n--- eval ---\n')

    # chart values
    text = script.split('data: [', 1)[1] # beginning
    text = text.split(']', 1)[0] # end
    values = eval(text)  # it creates tuple
    print(values)

    # title 
    # I split `yAxis` because there is other `title` without text
    # I split beginning in few steps because text may have different indentations (different number of spaces)
    # (you could use regex to split in one step)
    text = script.split("title: {\n", 1)[1] # beginning
    text = text.split("text: '", 1)[1] # beginning
    title = text.split("'", 1)[0] # end
    print('\ntitle:', title)

    text = script.split("yAxis: {\n", 1)[1] # beginning
    text = text.split("title: {\n", 1)[1] # beginning
    text = text.split("text: '", 1)[1] # beginning
    title = text.split("'", 1)[0] # end
    print('\ntitle:', title)

def test_json(script):
    print('\n--- json ---\n')

    # chart values
    text = script.split('data: [', 1)[1] # beginning
    text = text.split(']', 1)[0] # end
    text = '[' + text + ']' # create correct JSON data
    values = json.loads(text) # this time doesn't need `dirtyjson`
    print(values)

    # title
    # I split `yAxis` because there is other `title` without text
    # I split beginning in few steps because text may have different indentations (different number of spaces)
    # (you could use regex to split in one step)
    text = script.split("title: {\n", 1)[1] # beginning
    text = text.split("text: '", 1)[1] # beginning
    title = text.split("'", 1)[0] # end
    print('\ntitle:', title)

    text = script.split("yAxis: {\n", 1)[1] # beginning
    text = text.split("title: {\n", 1)[1] # beginning
    text = text.split("text: '", 1)[1] # beginning
    title = text.split("'", 1)[0] # end
    print('\ntitle:', title)

def test_js2xml(script):
    print('\n--- js2xml ---\n')

    data = js2xml.parse(script)

    # chart values (short and nice path)
    values = data.xpath('//property[@name="data"]//number/@value')
    #values = [int(x) for x in values] # it may need to convert to int() or float()
    #values = [float(x) for x in values] # it may need to convert to int() or float()
    print(values)

    # title (short and nice path)
    #print(js2xml.pretty_print(data.xpath('//property[@name="title"]')[0]))
    #title = data.xpath('//property[@name="title"]//string/text()')
    #print(js2xml.pretty_print(data.xpath('//property[@name="yAxis"]//property[@name="title"]')[0]))

    title = data.xpath('//property[@name="title"]//string/text()')
    title = title[0]
    print('\ntitle:', title)

    title = data.xpath('//property[@name="yAxis"]//property[@name="title"]//string/text()')
    title = title[0]
    print('\ntitle:', title)

def test_pyjsparser(script):
    print('\n--- pyjsparser ---\n')

    data = pyjsparser.parse(script)

    print("body's number:", len(data['body']))

    for number, body in enumerate(data['body']):
        if (body['type'] == 'ExpressionStatement'
            and body['expression']['callee']['object']['name'] == 'Highcharts'
            and len(body['expression']['arguments']) > 1):

            arguments = body['expression']['arguments']
            #print(json.dumps(values, indent=2))
            for properties in arguments[1]['properties']:
                #print('name: >{}<'.format(p['key']['name']))
                if properties['key']['name'] == 'series':
                    values = properties['value']['elements'][0]
                    values = values['properties'][-1]
                    values = values['value']['elements'] # a lot of work to find it
                    #print(json.dumps(values, indent=2))

                    values = [x['value'] for x in values]
                    print(values)

    # title (very complicated path) 
    # It needs more work to find correct indexes to get title
    # so I skip this part as too complex.

# --- main ---

url= 'https://www.worldometers.info/coronavirus/#countries'

r = requests.get(url)
#print(r.text)
soup = BeautifulSoup(r.text, "html.parser")

all_scripts = soup.find_all('script')
print('number of scripts:', len(all_scripts))

for number, script in enumerate(all_scripts):

    #if 'data: [' in script.text:
    if 'Highcharts.chart' in script.text:
        print('\n=== script:', number, '===\n')
        test_eval(script.text)
        test_json(script.text)
        test_js2xml(script.text)
        test_pyjsparser(script.text)

I take it from my blog: Scraping: How to get data from interactive plot created with HighCharts

furas
  • 134,197
  • 12
  • 106
  • 148
1

I used furas code and made a version to grab data from various countries.

import requests
from bs4 import BeautifulSoup 
import json
import pyjsparser
import js2xml

def scrape(country):
    url = 'https://www.worldometers.info/coronavirus/country/' + country + '/'
    soup = BeautifulSoup(requests.get(url).text, "html.parser")
    scripts = soup.find_all('script')
    counter = 0
    retlist = [ ]
    retlist.append(country);
    while counter < len(scripts):
        try:
            if js2xml.parse(scripts[counter].string).xpath('//property[@name="title"]//string/text()')[0] in ['Total Cases', 'Active Cases', 'Total Deaths']:
                retlist.append(js2xml.parse(scripts[counter].string).xpath('//property[@name="title"]//string/text()')[0])
                retlist = retlist + json.loads('[' + scripts[counter].string.split('data: [', 1)[1].split(']', 1)[0] + ']')
        except:
            pass
        counter = counter + 1
    return retlist

countries = ['us', 'spain', 'italy', 'france', 'germany', 'iran', 'china', 'uk', 'south-korea']
for country in countries:
    print(scrape(country))
Richard Sandoz
  • 327
  • 4
  • 9
0

Here is my take on this:

First, you have to get the table as a NumPy array:

import requests
from bs4 import BeautifulSoup
import numpy as np

def convertDigit(string):
    if string.replace(",", "").isdigit():
        return int(string.replace(",", ""))
    return string

url = 'https://www.worldometers.info/coronavirus/#countries'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser") # Parse html

table = soup.find("table", {"id": "main_table_countries"}).find_all("tbody") # table
tr_elems = table[0].find_all("tr") # All rows in table

data = []
for tr in tr_elems: # Loop through rows
    td_elems = tr.find_all("td") # Each column in row
    data.append([convertDigit(td.text.strip()) for td in td_elems])

np_array = np.array(data)

Now, all your data is inside np_array. After this, it should be pretty simple to convert your numpy array into a graph.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Ayush Garg
  • 2,234
  • 2
  • 12
  • 28
  • Thanks for your answer, but I don't want to scrape data from the table. What I want to do is get data from the graph. For example, data from death ratio graph vs date. – Mohammed Albassam Mar 06 '20 at 03:58
  • Okay, so poking around in the code, I saw that the graphs were generated using this module named `highcharts`. I searched that up on google and found [this](https://stackoverflow.com/questions/39305877/can-i-scrape-the-raw-data-from-highcharts-js). Hopefully this helps you! – Ayush Garg Mar 06 '20 at 04:10
  • I used information from your link to create example in my answer, thanks. And later I found the same data as table on subpage https://www.worldometers.info/coronavirus/coronavirus-death-toll/ . It is something useful for your answer. – furas Mar 06 '20 at 10:07
  • @furas We need country-wise daily data for variables like Deaths, Confirmed, Active, Recovered – maliks Apr 09 '20 at 10:10
  • @maliks you have table with countries on https://www.worldometers.info/coronavirus/#countries - if you will read this table every day then you will have daily data. OR got to GitHub with daily .csv https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports – furas Apr 09 '20 at 10:22
  • @furas but this GitHub link belongs to John Hopkins as a data source, I guess, not the Worldometers, I need Worldometers data like on this https://www.worldometers.info/coronavirus/country/switzerland/ page, we have got some 4-5 charts, I need the data behind these charts as this data is not available in table form over the Worldometers website or anywhere else, if it is in tabular form or CSV form, please do share with me – maliks Apr 10 '20 at 17:57
  • @maliks why to waste time to get data from chart if you have CSV files on GitHub. And if you readlly need to get it from Worldometers then you will have to get data from `Highcharts.chart(....)` (like in my answer) because there is no other data. – furas Apr 10 '20 at 18:12
  • @maliks BTW: data on GitHub doesn't belong to John Hopkins but to University of name John Hopkins. – furas Apr 10 '20 at 18:18
  • @maliks BTW: at the bottom of GitHub page as first source you can see `"World Health Organization (WHO)"`. At the bottom of Worldometers page https://www.worldometers.info/coronavirus/ as first source you can also see `"World Health Organization (WHO)" ` – furas Apr 10 '20 at 18:21
  • @maliks BTW: if you need Switzerland then on your page https://worldometers.info/coronavirus/country/switzerland` there are links in "LatestUpdates" to source page https://rsalzer.github.io/COVID_19_CH/ and it has link to GitHub https://github.com/rsalzer/COVID_19_CH and it has link to GitHub with CSV for Kantons in Switzerland https://github.com/openZH/covid_19/tree/master/fallzahlen_kanton_total_csv – furas Apr 10 '20 at 18:27