-1

I am new to this so please help me. I am using urllib.request to open and reading webpages. Can someone tell me how can my code handle redirects, timeouts, badly formed URLs? I have sort of found a way for timeouts, I am not sure if it is correct though. Is it? All opinions are welcomed! Here it is:

from socket import timeout
import urllib.request
try:
            text = urllib.request.urlopen(url, timeout=10).read().decode('utf-8')
except (HTTPError, URLError) as error:
            logging.error('Data of %s not retrieved because %s\nURL: %s', name, error, url)
except timeout:
            logging.error('socket timed out - URL %s', url)

Please help me as I am new to this. Thanks!

anon
  • 1,258
  • 10
  • 17

2 Answers2

0

Take a look at the urllib error page.

So for following behaviours:

  • Redirect: HTTP code 302, so that's a HTTPError with a code. You could also use the HTTPRedirectHandler instead of failing.
  • Timeouts: You have that correct.
  • Badly formed URLs: That's a URLError.

Here's the code I would use:

from socket import timeout
import urllib.request
try:
    text = urllib.request.urlopen("http://www.google.com", timeout=0.1).read()
except urllib.error.HTTPError as error:
    print(error)
except urllib.error.URLError as error:
    print(error)
except timeout as error:
    print(error)

I can't finding a redirecting URL, so I'm not exactly sure how to check the HTTPError is a redirect.

You might find the requests package is a bit easier to use (it's suggested on the urllib page).

Darkstarone
  • 4,590
  • 8
  • 37
  • 74
  • Thank you for the response. I had a question about your code. Why did you not have a decode() methord in the end? Wouldn't read() return bytes. So I tried running that code it works but when just add a decode('utf-8') at the end of your code, it crashes saying: UnicodeDecodeError. How do you decide when to use decode and when not to – anon May 29 '17 at 22:51
  • If you print out `text` you'll see it's plaintext (it was for me). – Darkstarone May 29 '17 at 22:54
  • But if you try doing any string operations like comparing it will say that cannot perform string operations on byte characters – anon May 29 '17 at 22:55
  • Good point, wrap `text` in `str()`. I.e. `str(text)`, or just wrap the entire urllib line in str: `str(urllib.request.urlopen("http://www.google.com", timeout=0.1).read())`. – Darkstarone May 29 '17 at 22:56
  • I'd add meaningful messages based on the content of the error, but I don't use this package personally so I couldn't be more specific I'm afraid. – Darkstarone May 29 '17 at 22:58
  • Already recommended it in my answer ;) [`requests`](http://docs.python-requests.org/en/master/) is suggested at the top of the [urllib page](https://docs.python.org/3/library/urllib.request.html#module-urllib.request). – Darkstarone May 29 '17 at 23:01
  • 1
    I added another except clause for value error. I just tried entering a gibberish URL like "ejknrvkjnew" and it throws ValueError. So I thought might as well add that – anon May 29 '17 at 23:11
  • how would you handle the mentioned errors using the requests package? I found that package to be easier and I am thinking of switching to it – anon May 30 '17 at 00:12
  • See the [exceptions](http://docs.python-requests.org/en/master/user/quickstart/#errors-and-exceptions) of the documentation. It defines the types of errors you'd see. Also see [this](https://stackoverflow.com/questions/16511337/correct-way-to-try-except-using-python-requests-module) question. – Darkstarone May 30 '17 at 00:14
0

Using requests package I was able to find a better solution. With the only exception you need to handle are:

 try:
        r = requests.get(url, timeout =5)

except requests.exceptions.Timeout:
# Maybe set up for a retry, or continue in a retry loop

except requests.exceptions.TooManyRedirects as error:
# Tell the user their URL was bad and try a different one

except requests.exceptions.ConnectionError:
# Connection could not be completed

except requests.exceptions.RequestException as e:
# catastrophic error. bail.

And to get the text of that page, all you need to do is: r.text

anon
  • 1,258
  • 10
  • 17