0

Hi I'm tring to run the code below and i get an error, HTTPError: Forbidden. It tells me that the line with a problem is in the requests.py file in the urllib folder. I wanted to extract data from an online website.

This is my code which i try to run

import pandas as pd
import geopandas as gpd

data = pd.read_html('https://www.worldometers.info/coronavirus/')

And this is the response i get from the spyder console

Python 3.8.2 (default, Mar 26 2020, 15:53:00)

Type "copyright", "credits" or "license" for more information.

IPython 7.13.0 -- An enhanced Interactive Python.

runfile('/home/evans/Desktop/GIS DEVELOPMENTS/PROJECTS/Coronavirus2020.py', wdir='/home/evans/Desktop/GIS DEVELOPMENTS/PROJECTS') Traceback (most recent call last):

File "/home/evans/Desktop/GIS DEVELOPMENTS/PROJECTS/Coronavirus2020.py", line 5, in data = pd.read_html('https://www.worldometers.info/coronavirus/')

File "/home/evans/anaconda3/envs/myenv/lib/python3.8/site-packages/pandas/io/html.py", line 1085, in read_html return _parse(

File "/home/evans/anaconda3/envs/myenv/lib/python3.8/site-packages/pandas/io/html.py", line 895, in _parse tables = p.parse_tables()

File "/home/evans/anaconda3/envs/myenv/lib/python3.8/site-packages/pandas/io/html.py", line 213, in parse_tables tables = self._parse_tables(self._build_doc(), self.match, self.attrs)

File "/home/evans/anaconda3/envs/myenv/lib/python3.8/site-packages/pandas/io/html.py", line 733, in _build_doc raise e

File "/home/evans/anaconda3/envs/myenv/lib/python3.8/site-packages/pandas/io/html.py", line 714, in _build_doc with urlopen(self.io) as f:

File "/home/evans/anaconda3/envs/myenv/lib/python3.8/site-packages/pandas/io/common.py", line 141, in urlopen return urllib.request.urlopen(*args, **kwargs)

File "/home/evans/anaconda3/envs/myenv/lib/python3.8/urllib/request.py", line 222, in urlopen return opener.open(url, data, timeout)

File "/home/evans/anaconda3/envs/myenv/lib/python3.8/urllib/request.py", line 531, in open response = meth(req, response)

File "/home/evans/anaconda3/envs/myenv/lib/python3.8/urllib/request.py", line 640, in http_response response = self.parent.error(

File "/home/evans/anaconda3/envs/myenv/lib/python3.8/urllib/request.py", line 569, in error return self._call_chain(*args)

File "/home/evans/anaconda3/envs/myenv/lib/python3.8/urllib/request.py", line 502, in _call_chain result = func(*args)

File "/home/evans/anaconda3/envs/myenv/lib/python3.8/urllib/request.py", line 649, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp)

HTTPError: Forbidden

The problem at first was that lxml was missing, so i decided to install it from my environment using pip3 install lxml, but this is the return message i got.

Requirement already satisfied: lxml in /usr/lib/python3/dist-packages (4.4.1).

But this is not in my environment folder, it is in the base/root folder. So i just decided to use pip install lxml and it worked. Then when i executed it, it returned the above error.

I will appreciate any guidance to help me overcome this problem.

Steel8
  • 143
  • 1
  • 11
  • "The problem at first was that lxml was missing, so i decided to install it from my environment using pip3 install lxml, but this is the return message i got.": that was a really bad move because pip and conda packages are not binary compatible. So the best thing you can is to remove and recreate your environment and (from now on) use conda to install your packages instead of pip (unless conda doesn't have a package for it). – Carlos Cordoba Mar 29 '20 at 15:11
  • Okay thanks you so much, let me recreate it. But when I wanted to install the LXML package in my environment using Conda install LXML, it returned solving environment errors. But in the base directory, it was already installed so it said requirement fulfilled. That's when I decided to use pip instead. Can you help me find another option that will ensure it will install in my environment and is this the reason I have the HTTPError: Forbidden – Steel8 Mar 29 '20 at 15:37
  • Perhaps there are problems with Python 3.8, so I'd advise you to 3.7 instead. – Carlos Cordoba Mar 29 '20 at 15:56

1 Answers1

0

It's probably the site blocking the scraping. Maybe...

HTTP error 403 in Python 3 Web Scraping

Hence try...

from urllib.request import Request, urlopen

req = Request('https://www.worldometers.info/coronavirus/', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()


tables = pd.read_html(webpage)
df = tables[0]
print(df.head())

Outputs:

 Country,Other  TotalCases NewCases  TotalDeaths NewDeaths  TotalRecovered  \
0           USA      123781     +203       2229.0        +8          3238.0   
1         Italy       92472      NaN      10023.0       NaN         12384.0   
2         Spain       78797   +5,562       6528.0      +546         14709.0   
3       Germany       58247     +552        455.0       +22          8481.0   
4          Iran       38309   +2,901       2640.0      +123         12391.0   

   ActiveCases  Serious,Critical  Tot Cases/1M pop  Deaths/1M pop 1stcase  
0       118314            2666.0             374.0            7.0  Jan 20  
1        70065            3856.0            1529.0          166.0  Jan 29  
2        57560            4165.0            1685.0          140.0  Jan 30  
3        49311            1581.0             695.0            5.0  Jan 26  
4        23278            3206.0             456.0           31.0  Feb 18  
MDR
  • 2,610
  • 1
  • 8
  • 18