6

Hope this is quite a simple question, but it's driving me crazy. I'm using Python 2.7.3 on an out of the box installation of ubuntu 12.10 server. I kept zooming on the problem until I got to this snippet:

import urllib2
x=urllib2.urlopen("http://casacinema.eu/movie-film-Matrix+trilogy+123+streaming-6165.html", timeout=5)

It simply hangs forever, never goes on timeout. I'm evidently doing something wrong. Anybody could please help? Thank you very much indeed!

Matteo

Matteo Monti
  • 8,362
  • 19
  • 68
  • 114

3 Answers3

4

Looks like you are experiencing the proxy issue. Here's a great explanation on how to workaround it: Trying to access the Internet using urllib2 in Python.

I've executed your code on my ubuntu with python 2.7.3 and haven't seen any errors.

Also, consider using requests:

import requests

response = requests.get("http://casacinema.eu/movie-film-Matrix+trilogy+123+streaming-6165.html", timeout=5)
print response.status_code

See also:

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Well.. I didn't configure any proxy on my server... I'm not sure about what should I do.. how can I detect the presence of a proxy which I should configure? – Matteo Monti May 27 '13 at 13:35
  • Please also note that I'm on a web server with public ip address, which is correctly detected from the outside.. – Matteo Monti May 27 '13 at 13:43
  • It also manages to load other web pages without any difficulty. Only some pages simply won't load and hang forever.. – Matteo Monti May 27 '13 at 13:44
  • Thanks. Hm, first see if smth is in `HTTP_PROXY` env variable (http://stackoverflow.com/questions/7338837/autodetect-proxy-setting-linux). Have you tried the code using `requests`? – alecxe May 27 '13 at 13:53
  • Yes, I tried your code using requests and it hung exactly as the other one. The HTTP_PROXY env variable is Null on my system...! Strange, isn't it? – Matteo Monti May 27 '13 at 17:10
  • Yes. Please try 2 things more: `urllib2.urlopen('http://google.com')` and `curl http://casacinema.eu/movie-film-Matrix+trilogy+123+streaming-6165.html`. What do you see? – alecxe May 27 '13 at 17:27
  • urlopen on google reads the page immediately and without any problem, curl hangs indefinetly without any output. – Matteo Monti May 27 '13 at 17:53
  • So, looks like the issue is not relevant to python/urllib. You simply can't access `casacinema.eu` from the server - firewall block? – alecxe May 27 '13 at 18:18
  • I don't really need it to be loaded. It would be enough if, after some time, python went on timeout throwing an exception. Casacinema.eu was just an example of something hanging! I just want it to stop trying after a while and GO ON! Is it possible somehow? There must be a way!! – Matteo Monti May 27 '13 at 20:19
  • Yeah, got it. Please try setting the timeout via `socket`: `import socket; socket.setdefaulttimeout(5)`. – alecxe May 27 '13 at 21:17
  • Already done! It didn't go on timeout anyway. It actually DOES go on timeout on any other system (on my laptop on example) but NOT on my server. – Matteo Monti May 27 '13 at 22:11
  • I'm going crazy about this. – Matteo Monti May 27 '13 at 22:11
  • @MatteoMonti did you figure out your issue? – Brent Mar 04 '15 at 18:59
  • Uhmm yes, I think I did, but being two years ago I really don't remember how! Sorry about that.. – Matteo Monti Mar 04 '15 at 21:03
1

The original poster stated they did not understand why it would hang, but they also wanted a way to keep urllib.request.urlopen from hanging. I can not say how to keep it from hanging but if it helps someone this is why it can hang.

The Python-urllib/3.6 client is picky. It expects, for example, the server to return HTTP/1.1 200 OK not HTTP 200 OK. It also expects the server to close the connection when it sends connection: close in the headers.

The best way to diagnose this is to get the raw output of the server response and compare it with another server response that you know works. Then, if you must create a server and manipulate the response to determine exactly what difference is the cause. Perhaps, that can lead at least to change on the server and allow it to not hang.

kmcguire
  • 874
  • 10
  • 12
1

Can try using socket.setdefaulttimeout(5) as alecxe suggested.

More details in urllib2 doc

Sockets and Layers

The Python support for fetching resources from the web is layered. urllib2 uses the httplib library, which in turn uses the socket library.

As of Python 2.3 you can specify how long a socket should wait for a response before timing out. This can be useful in applications which have to fetch web pages. By default the socket module has no timeout and can hang. Currently, the socket timeout is not exposed at the httplib or urllib2 levels. However, you can set the default timeout globally for all sockets using

import socket
import urllib2

# timeout in seconds
timeout = 10
socket.setdefaulttimeout(timeout)
Kenneth.Wong
  • 121
  • 1
  • 4