python urllib2 returns garbage

Question

I am trying to download a web page with python and access some elements on the page. I have an issue when I download the page: the content is garbage. This is the first lines of the page:

‹í}évÛH²æïòSd±ÏmÉ·’¸–%ÕhµÕ%ÙjI¶«JããIÐ(‰îî{æ1æ÷¼Æ¼Í}’ù"à""’‚d÷t»N‰$–\"ãËˆŒˆŒÜøqïíîùï'û¬¼gôÁnžm–úq<ü¹R¹¾¾._›å ìUôv»]¹¡gJÌqÃÍ’‡%z‹[ÎÖ3†[(,jüËÈ½Ú,í~ÌýX;y‰Ùò×f)æ7q…JzÉì¾F<ÞÅ]Uª

this problem happen only on the following website: http://kickass.to. Is it possible that they have somehow protected their page? this is my python code:

import urllib2
import chardet
url = 'http://kickass.to/'
user_agent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KH
TML, like Gecko) Chrome/6.0.472.63 Safari/534.3'
headers = { 'User-Agent' : user_agent }
req = urllib2.Request(url, None, headers)
response = urllib2.urlopen(req)
page = response.read()
f = open('page.html','w')
f.write(page)
f.close()
print response.headers['content-type']
print chardet.detect(page)

and result:

text/html; charset=UTF-8
{'confidence': 0.0, 'encoding': None}

it looks like an encoding issue but chardet detects 'None'.. Any ideas?

Can you access the URL in your browser? I don't know about you but my ISP blocks that site. Maybe it is something to do with that? — anon582847382, Feb 22 '14 at 23:21
Wow, I have little experience with python 2, but maybe try `urllib`rather than `urllib2` for the sake of trying? — anon582847382, Feb 22 '14 at 23:34
You might notice that `wget` fetches the same 'garbage' (actually sane, but gzipped) content. — nullptr, Feb 22 '14 at 23:37

score 5 · Accepted Answer · edited May 23 '17 at 12:15

5

This page is returned in gzip encoding.

(Try printing out response.headers['content-encoding'] to verify this.)

Most likely the web-site doesn't respect 'Accept-Encoding' field in request and suggests that the client supports gzip (most modern browsers do).

urllib2 doesn't support deflating, but you can use gzip module for that as described e.g. in this thread: Does python urllib2 automatically uncompress gzip data fetched from webpage? .

edited May 23 '17 at 12:15

Community

1
1

answered Feb 22 '14 at 23:29

nullptr

11,008
1
23
18

I already printed this out and got {'confidence': 0.0, 'encoding': None} back. Is it gzip then? – user3341975 Feb 22 '14 at 23:34
1

You've printed results of charset detection. I bet the charset detector doesn't expect gzip. And yes, if you print 'content-encoding' from headers, you will see 'gzip'. – nullptr Feb 22 '14 at 23:36
Thanks a lot it worked. I though I had properly tested the gzip encoding before publishing the issue.. Cheers – user3341975 Feb 22 '14 at 23:46

python urllib2 returns garbage

1 Answers1