How to read webpages that are without .htm* extension using Python?

Question

I frequently use the urllib2 library to parse web pages in python. Normally, the URL is in the form:

page_url = 'http://www.website.com/webpage.html'

I use this to parse the page:

import urllib2

def read_page_contents(url):
    try:
        request = urllib2.Request(url)
        handle = urllib2.urlopen(request)
        content = handle.read()
    except:
        # aded as suggested by contributers below:
        import traceback
        traceback.print_exc()
        content = None
    return content

page = read_page_contents(page_url)
if page is not None:
    # start dealing with page contents
    pass

This passes without problems, but When I tried a URL that comes without html extension like the one below, page_url = 'https://energyplus.net/weather-region/north_and_central_america_wmo_region_4'

this method failed to read the page, it always returns None! and an error message

raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 403: Forbidden:

I searched Stackoverflow, but, according to my keywords, I found nothing useful!

Please help me solving this problem.

Thanks in advance

----------

I found the answer, thanks to the help of the 2 contributors below:

import requests

def read_page_contents(url):
    try:
        request = requests.get(url)
        content = request.content
    except:
        # aded as suggested by contributers below:
        import traceback
        traceback.print_exc()
        content = None
    return content

See [this question](http://stackoverflow.com/questions/33972671/downloading-https-pages-with-urllib-error14077438ssl-routinesssl23-get-serve). The reason it's returning none is because you told it to by using a plain `except` and putting `return None`. Remove that and you will get a more informative error message. — BrenBarn, Jan 11 '17 at 06:56

score 2 · Answer 1 · answered Jan 11 '17 at 06:53

2

This has nothing to do with the fact that you don't have .html in your url. Your code itself is rather confusing. There is page_url in one location and continent_url in another. So you wouldn't be able to execute this code. I am assuming that's a copy paste problem. The real error in your code is this

except:
    content = None

Never ever do this. If you have a generic catch all exception, you absolutely must log that

except:
   import traceback
   traceback.print_exc()
   content = None

You will see that real problem with the page that you are trying to retrieve (which turns out to be a permission issue).

answered Jan 11 '17 at 06:53

e4c5

52,766
11
101
134

Thank you for your comment, The `continent_url` was a typo, as I copied it from the original code, but it should be `page_url` , and I fixed it. About the exception, I will add the two lines you suggested and see what is the problem. Thanks – Mohammad ElNesr Jan 11 '17 at 06:57
I tried the exception handler you kindly suggested, it returns the following: `HTTPError: HTTP Error 403: Forbidden` – Mohammad ElNesr Jan 11 '17 at 07:00
There you are. That's a page you are not allowed to retrieve. – e4c5 Jan 11 '17 at 07:08

score 2 · Accepted Answer · answered Jan 11 '17 at 07:01

2

Use requests and save you time to do more meaningful things.

import requests

url = 'https://energyplus.net/weather-region/north_and_central_america_wmo_region_4'
r = requests.get(url)

out:

r.status_code: 200

answered Jan 11 '17 at 07:01

宏杰李

11,820
2
28
35

Thank you for your suggestion, the requests.get solved the problem. What do you mean by "do more meaningful things", is the urllib2 is not meaningful? – Mohammad ElNesr Jan 11 '17 at 07:11
@Mohammad ElNesr I means urllib is low-level library, you need to code everything and consider every tiny situation. Just use high-level requests, and maybe take a walk or do sth you like. – 宏杰李 Jan 11 '17 at 07:18

How to read webpages that are without .htm* extension using Python?

2 Answers2