181

I was trying to scrape a website for practice, but I kept on getting the HTTP Error 403 (does it think I'm a bot)?

Here is my code:

#import requests
import urllib.request
from bs4 import BeautifulSoup
#from urllib import urlopen
import re

webpage = urllib.request.urlopen('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1').read
findrows = re.compile('<tr class="- banding(?:On|Off)>(.*?)</tr>')
findlink = re.compile('<a href =">(.*)</a>')

row_array = re.findall(findrows, webpage)
links = re.finall(findlink, webpate)

print(len(row_array))

iterator = []

The error I get is:

 File "C:\Python33\lib\urllib\request.py", line 160, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python33\lib\urllib\request.py", line 479, in open
    response = meth(req, response)
  File "C:\Python33\lib\urllib\request.py", line 591, in http_response
    'http', request, response, code, msg, hdrs)
  File "C:\Python33\lib\urllib\request.py", line 517, in error
    return self._call_chain(*args)
  File "C:\Python33\lib\urllib\request.py", line 451, in _call_chain
    result = func(*args)
  File "C:\Python33\lib\urllib\request.py", line 599, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
Dharman
  • 30,962
  • 25
  • 85
  • 135
Josh
  • 3,231
  • 8
  • 37
  • 58

12 Answers12

351

This is probably because of mod_security or some similar server security feature which blocks known spider/bot user agents (urllib uses something like python urllib/3.3.0, it's easily detected). Try setting a known browser user agent with:

from urllib.request import Request, urlopen

req = Request(
    url='http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1', 
    headers={'User-Agent': 'Mozilla/5.0'}
)
webpage = urlopen(req).read()

This works for me.

By the way, in your code you are missing the () after .read in the urlopen line, but I think that it's a typo.

TIP: since this is exercise, choose a different, non restrictive site. Maybe they are blocking urllib for some reason...

Jaroslav Bezděk
  • 6,967
  • 6
  • 29
  • 46
Stefano Sanfilippo
  • 32,265
  • 7
  • 79
  • 80
  • I assume it's safe to reuse `req` for multiple `urlopen` calls. – Asclepius Feb 02 '19 at 00:28
  • 2
    It might be little late, but I already have User-Agent in my code, still it gives me `Error 404: Access denied` – Reema Parakh Jul 24 '19 at 04:23
  • This works but I feel like they must have a good reason to block bots and I'm violating their terms of service – xjcl Oct 11 '19 at 07:19
  • 2
    This unfortunately does not work for some sites. There's a `requests` solution https://stackoverflow.com/questions/45086383/python-requests-403-forbidden-despite-setting-user-agent-headers though. – NelsonGon Jul 21 '21 at 10:55
  • 2
    Some sites block `'Mozilla/5.0'` as well. You may want to try `'Mozilla/6.0'` or other headers. – Qin Heyang Feb 11 '22 at 00:21
55

Definitely it's blocking because of your use of urllib based on the user agent. This same thing is happening to me with OfferUp. You can create a new class called AppURLopener which overrides the user-agent with Mozilla.

import urllib.request

class AppURLopener(urllib.request.FancyURLopener):
    version = "Mozilla/5.0"

opener = AppURLopener()
response = opener.open('http://httpbin.org/user-agent')

Source

zeta
  • 1,517
  • 1
  • 20
  • 14
  • 3
    The top answer didn't work for me, while yours did. Thanks a lot! – Tarun Uday Mar 31 '16 at 19:32
  • 1
    This works just fine but I need to attach the ssl configuration to this. How do I do this? Before I just added it as a second parameter (urlopen(request,context=ctx)) – Hauke Apr 25 '17 at 17:40
  • 4
    looks like it did open but it says 'ValueError: read of closed file' – MartianMartian May 11 '17 at 15:37
  • @zeta How did you manage to scrape OfferUp and provide the requisite geo coordinates to perform the search from a script? – CJ Travis Jun 21 '17 at 14:35
  • @CJTravis , I wasn't scraping OfferUp. I was just retrieving item values based on an exact URL of an item. That didn't require any geo coordinates for me – zeta Jun 23 '17 at 08:00
  • This didn't work for me. I got a "read of a closed file" just as @Martian2049 did. Do you need to "chain" on a .read() to the last statement for this solution work? – user2101068 Aug 04 '20 at 21:44
  • 2
    This works but producing warning ```DeprecationWarning: AppURLopener style of invoking requests is deprecated. Use newer urlopen functions/methods``` in python 3.7 – kjsr7 Dec 30 '20 at 09:45
  • 1
    main__:1: DeprecationWarning: AppURLopener style of invoking requests is deprecated. Use newer urlopen functions/methods after opener = AppURLopener() . However, how to use the response to get a video if the url can download a ts file using chrome? – Raii May 21 '22 at 02:54
  • @zeta Literally the best answer forever. I kept trying to webscrape, but kept on getting a 403 (forbidden access error). I tried messing around with HTTP headers for a while and user agents, but that didn't do any good. However, this answer did exactly what I wanted and the website accepted my HTTP request. Thank you very much, I'm glad it still works in 2022. – Some Guy Dec 16 '22 at 06:56
27

"This is probably because of mod_security or some similar server security feature which blocks known

spider/bot

user agents (urllib uses something like python urllib/3.3.0, it's easily detected)" - as already mentioned by Stefano Sanfilippo

from urllib.request import Request, urlopen
url="https://stackoverflow.com/search?q=html+error+403"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})

web_byte = urlopen(req).read()

webpage = web_byte.decode('utf-8')

The web_byte is a byte object returned by the server and the content type present in webpage is mostly utf-8. Therefore you need to decode web_byte using decode method.

This solves complete problem while I was having trying to scrape from a website using PyCharm

P.S -> I use python 3.4

royatirek
  • 2,437
  • 2
  • 20
  • 34
10

Based on previous answers this has worked for me with Python 3.7 by increasing the timeout to 10.

from urllib.request import Request, urlopen

req = Request('Url_Link', headers={'User-Agent': 'XYZ/3.0'})
webpage = urlopen(req, timeout=10).read()

print(webpage)
royatirek
  • 2,437
  • 2
  • 20
  • 34
Jonny_P
  • 127
  • 1
  • 4
5

Adding cookie to the request headers worked for me

from urllib.request import Request, urlopen

# Function to get the page content
def get_page_content(url, head):
  """
  Function to get the page content
  """
  req = Request(url, headers=head)
  return urlopen(req)

url = 'https://example.com'
head = {
  'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36',
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
  'Accept-Encoding': 'none',
  'Accept-Language': 'en-US,en;q=0.8',
  'Connection': 'keep-alive',
  'refere': 'https://example.com',
  'cookie': """your cookie value ( you can get that from your web page) """
}

data = get_page_content(url, head).read()
print(data)
Deepesh Nair
  • 330
  • 5
  • 6
  • you saved me. I have met a url that need to add some other things in the header such as 'origin' = 'url1' , 'referrer' = 'url1' to make the request without 403 happen – Raii May 21 '22 at 03:24
3

If you feel guilty about faking the user-agent as Mozilla (comment in the top answer from Stefano), it could work with a non-urllib User-Agent as well. This worked for the sites I reference:

    req = urlrequest.Request(link, headers={'User-Agent': 'XYZ/3.0'})
    urlrequest.urlopen(req, timeout=10).read()

My application is to test validity by scraping specific links that I refer to, in my articles. Not a generic scraper.

Sudeep Prasad
  • 91
  • 1
  • 2
2

Since the page works in browser and not when calling within python program, it seems that the web app that serves that url recognizes that you request the content not by the browser.

Demonstration:

curl --dump-header r.txt http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1

...
<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>
You don't have permission to access ...
</HTML>

and the content in r.txt has status line:

HTTP/1.1 403 Forbidden

Try posting header 'User-Agent' which fakes web client.

NOTE: The page contains Ajax call that creates the table you probably want to parse. You'll need to check the javascript logic of the page or simply using browser debugger (like Firebug / Net tab) to see which url you need to call to get the table's content.

Robert Lujo
  • 15,383
  • 5
  • 56
  • 73
2

you can use urllib's build_opener like this:

opener = urllib.request.build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'), ('Accept','text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8'), ('Accept-Encoding','gzip, deflate, br'),\
    ('Accept-Language','en-US,en;q=0.5' ), ("Connection", "keep-alive"), ("Upgrade-Insecure-Requests",'1')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(url, "test.xlsx")
grantr
  • 878
  • 8
  • 16
1

You can try in two ways. The detail is in this link.

1) Via pip

pip install --upgrade certifi

2) If it doesn't work, try to run a Cerificates.command that comes bundled with Python 3.* for Mac:(Go to your python installation location and double click the file)

open /Applications/Python\ 3.*/Install\ Certificates.command

Johnson
  • 13
  • 1
1

I ran into this same problem and was not able to solve it using the answers above. I ended up getting around the issue by using requests.get() and then using the .text of the result instead of using read():

from requests import get

req = get(link)
result = req.text
0

I pulled my hair out with this for a while and the answer ended up being pretty simple. I checked the response text and I was getting "URL signature expired" which is a message you wouldn't normally see unless you checked the response text.

This means some URLs just expire, usually for security purposes. Try to get the URL again and update the URL in your script. If there isn't a new URL for the content you're trying to scrape, then unfortunately you can't scrape for it.

Toakley
  • 182
  • 3
  • 13
0

Open the developer tools and open the network tap. chose among the items u want yo scrap, the expanding details will have the user agent and add it there