3

I am trying to download a web page with python and access some elements on the page. I have an issue when I download the page: the content is garbage. This is the first lines of the page:

‹í}évÛH²æïòSd±ÏmÉ·’¸–%ÕhµÕ%ÙjI¶«JããIÐ(‰îî{æ1æ÷¼Æ¼Í}’ù"à""’‚d÷t»N‰$–\"ãˈŒˆŒÜøqïíîùï'û¬¼­gôÁnžm–úq<ü¹R¹¾¾._›å ìUôv»]¹¡gJÌqÃÍ’‡%z‹[ÎÖ3†[(,jüËȽÚ,í~ÌýX;y‰Ùò×f)æ7q…JzÉì¾F<ÞÅ]­Uª

this problem happen only on the following website: http://kickass.to. Is it possible that they have somehow protected their page? this is my python code:

import urllib2
import chardet
url = 'http://kickass.to/'
user_agent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KH
TML, like Gecko) Chrome/6.0.472.63 Safari/534.3'
headers = { 'User-Agent' : user_agent }
req = urllib2.Request(url, None, headers)
response = urllib2.urlopen(req)
page = response.read()
f = open('page.html','w')
f.write(page)
f.close()
print response.headers['content-type']
print chardet.detect(page)

and result:

text/html; charset=UTF-8
{'confidence': 0.0, 'encoding': None}

it looks like an encoding issue but chardet detects 'None'.. Any ideas?

1 Answers1

5

This page is returned in gzip encoding.

(Try printing out response.headers['content-encoding'] to verify this.)

Most likely the web-site doesn't respect 'Accept-Encoding' field in request and suggests that the client supports gzip (most modern browsers do).

urllib2 doesn't support deflating, but you can use gzip module for that as described e.g. in this thread: Does python urllib2 automatically uncompress gzip data fetched from webpage? .

Community
  • 1
  • 1
nullptr
  • 11,008
  • 1
  • 23
  • 18
  • I already printed this out and got {'confidence': 0.0, 'encoding': None} back. Is it gzip then? – user3341975 Feb 22 '14 at 23:34
  • 1
    You've printed results of charset detection. I bet the charset detector doesn't expect gzip. And yes, if you print 'content-encoding' from headers, you will see 'gzip'. – nullptr Feb 22 '14 at 23:36
  • Thanks a lot it worked. I though I had properly tested the gzip encoding before publishing the issue.. Cheers – user3341975 Feb 22 '14 at 23:46