4

I'm trying to fetch some data from http://m.finnkino.fi/events/now_showing, but at the moment I'm failing badly because I'm not even able to load the page source with python. At the moment I'm using following code:

req = urllib2.urlopen(URL,None,2.5)
page = req.read()
print page

Here is the traceback for timeout error:

Traceback (most recent call last):
 File "user/src/finnkinoParser.py", line 26, in <module>
main()
File "user/src/finnkinoParser.py", line 13, in main
getNowPlayingMovies()
File "user/src/finnkinoParser.py", line 17, in getNowPlayingMovies
     req = urllib2.urlopen(baseURL,None,2.5)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 124, in urlopen
return _opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 383, in open
response = self._open(req, data)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 401, in _open
'_open', req)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 361, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 1130, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 1105, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error timed out>

If I browse to the url with my browser it works fine. So could someone tell me what makes that site that much different so the urllib2 is unable to load the page. I suppose it has something to do with the site being aimed to mobile users. With "regular" sites urllib2 works fine. Is there any other kind of sites to which the basic urlopen(URL) doesn't work?

Thanks for help

  • Do you not think it might help for us to know what happens when you run that code? How does it differ from what you expect? What errors do you get? – Daniel Roseman May 19 '11 at 15:52
  • I get a timeout too. WGET works though. So it's not an issue of user agent, just tried with a custom urlopenener and it doesn't work either. – Sebastian Blask May 19 '11 at 15:56
  • congratulations, this is a really strange problem you've found. On python 2.7, it hangs forever on socket.py line 447 in a call to `self._sock.recv`, which is built in to python and doesn't have any associated python source code. This goes much deeper than python and urllib2. – Mu Mind May 19 '11 at 16:26
  • 1
    @Jathanism: no, wget and curl have nothing to do with Javascript and load the page just fine. – Mu Mind May 19 '11 at 16:27
  • It doesn't hang forever - eventually I get "Connection reset by peer" in a `URLError`. Bizarre, though - I've tried borrowing the UA string from my browser too. Python 3.2 fares no better. Someone might want to look at opening a bug for Python. – Thomas K May 19 '11 at 17:14

1 Answers1

3

Following snippet works fine.

import httplib
headers = {"User-Agent": "Mozilla/5.0"}
conn = httplib.HTTPConnection("m.finnkino.fi")
conn.request("GET", "/events/now_showing", "", headers)
response = conn.getresponse()
print response.status, response.reason
data = response.read()
print data
conn.close()

It seems their server has verified several request vars. After tested some times, here is conclusion:

  1. http protocol must be HTTP/1.1.
  2. if request headers have Connection prop, its value should be keep-alive.
  3. request headers must have User-Agent prop, whatever its value.

While in urllib2, Connection prop in HTTPHandler has been set to Close by default (L1127 in urllib2.py). you can use urlgrabber or other HTTP handler which supports HTTP/1.1 and keep-alive.

Community
  • 1
  • 1
silverfox
  • 5,254
  • 1
  • 21
  • 26
  • Thanks a lot. The snippet works fine. Can you think of any reason why the server side has requirements like that? – Juho Salmio May 20 '11 at 08:53
  • Maybe they want to make sure every request is submitted by a really person, not by programs or something. – silverfox May 20 '11 at 09:56