1

I have a google sheet with coronavirus data. I want to update it using the worldometers site. I don't want to copy the CSS selector for every single of the thousands of cells in my code.

I tried getting the table but it is separated by newlines after each. I got the table with the following code

    import bs4
    import requests

    res = requests.get('https://www.worldometers.info/coronavirus')
    soup = bs4.BeautifulSoup(res.text, 'html.parser')
    print(len(soup.select('table')))
    txt = soup.select('table')[1]
    print(txt.text)

Is there a way to make what we get from the table int a format that can be put into excel or put the table HTML itself into excel so that will formatted properly.

Anonymous
  • 25
  • 6
  • If you use Power Query in Excel, and just type in that URL, it can easily download the "now" and "yesterday" tables into Excel. I f you need to breakdown each country, it is more complicated. – Ron Rosenfeld May 02 '20 at 23:32

2 Answers2

1

I suggest you take a look at the following python modules:

  1. pandas: for handling tabular data (official documentation, guide to convert an HTML table into excel);
  2. openpyxl: for Excel Spreadsheets (see this guide);
  3. EZSheets; for Google Spreadsheets (see this guide).

I hope these resources will help you.

Edit: the code below (based on this) should allow you to retrieve the table.

import pandas as pd, openpyxl, os.path

if os.path.isfile("coronaData.xlsx") == False:
    openpyxl.Workbook().save("coronaData.xlsx")

url = "https://www.worldometers.info/coronavirus"
hdr = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
    "X-Requested-With": "XMLHttpRequest"
    }

#See https://stackoverflow.com/questions/45943179/python-pandas-typeerror-first-argument-must-be-string-or-compiled-pattern#45944194
key = str(frozenset(hdr.items())) 

table = pd.read_html(url, key)[0] 

table.to_excel("coronaData.xlsx") 


However, I keep stumbling upon the error urllib.error.HTTPError: HTTP Error 403: Forbidden, which could probably be solved with the correct headers (hdr). Maybe it would be easier with another website (without the hdr and key variables; e.g. Wikipedia).

westload
  • 46
  • 6
0

I think I found an answer to my question. I did some research and found how to make it into a CSV file by appending each row. I can also take the population using this method. After I got the CSV file I just put it in my google drive and converted it to google sheet. The code is shown below.

    import bs4
    import requests
    import csv

    res = requests.get('https://www.worldometers.info/coronavirus')
    soup = bs4.BeautifulSoup(res.text, 'html.parser')
    print(len(soup.select('table')))
    txt = soup.select('table')[1]
    txt2 = bs4.BeautifulSoup(requests.get('https://www.worldometers.info/world- 
    population/population-by-country/').text,
                     'html.parser').select('table')[0]
    output_covidrows = []
    output_populatoinrows = []

    for table_row2 in txt2.findAll('tr'):
    columns2 = table_row2.findAll('td')
    output_row2 = []
    for column2 in columns2:
    output_row2.append(column2.text.strip())
    output_populatoinrows.append(output_row2)

    for table_row in txt.findAll('tr'):
    columns = table_row.findAll('td')
    output_row = []
    for column in columns:
         output_row.append(column.text.strip())
         for populationentry in output_populatoinrows:
             if len(populationentry) < 2 or len(output_row) < 2:
                 continue
             if populationentry[1] == output_row[0]:
                  output_row.append(populationentry[2])
                  output_row.append(populationentry[9])
         if len(output_row) > 2:
               output_covidrows.append(output_row)
    with open('output.csv', 'w') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(
            ['Country', 'Total Cases', 'New Cases', 'Total Deaths', 'New Deaths', 
             'Total Recovered', 'Active Cases',
              'Serious Cases', 'Tot Cases per 1m', 'Deaths per 1m', 'Total tests', 
             'Tests per 1m', 'Region', 'Population',
              'Med Age'])
         writer.writerows(output_covidrows)

    print('Done')
Anonymous
  • 25
  • 6