I frequently use the urllib2 library to parse web pages in python. Normally, the URL is in the form:
page_url = 'http://www.website.com/webpage.html'
I use this to parse the page:
import urllib2
def read_page_contents(url):
try:
request = urllib2.Request(url)
handle = urllib2.urlopen(request)
content = handle.read()
except:
# aded as suggested by contributers below:
import traceback
traceback.print_exc()
content = None
return content
page = read_page_contents(page_url)
if page is not None:
# start dealing with page contents
pass
This passes without problems, but When I tried a URL that comes without html extension like the one below, page_url = 'https://energyplus.net/weather-region/north_and_central_america_wmo_region_4'
this method failed to read the page, it always returns None! and an error message
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 403: Forbidden:
I searched Stackoverflow, but, according to my keywords, I found nothing useful!
Please help me solving this problem.
Thanks in advance
----------
I found the answer, thanks to the help of the 2 contributors below:
import requests
def read_page_contents(url):
try:
request = requests.get(url)
content = request.content
except:
# aded as suggested by contributers below:
import traceback
traceback.print_exc()
content = None
return content