1

When I crawl the webpage using urllib2, I can't get the page source but a garbled string which I can't understand what it's. And my code as follow:

url = 'http://finance.sina.com.cn/china/20150905/065523161502.shtml'
conn = urllib2.urlopen(url)
content = conn.read()
print content

Can anyone help me find out what's wrong? Thank you so much.

Update: I think you can run the code above to get what I get. and follows is what I get in python:

{G?0????l???%ߐ?C0 ?K?z?%E |?B ??|?F?oeB?'??M6? y???~???;j????H????L?mv:??:]0Z?Wt6+Y+LV? VisV:캆P?Y?, O?m?p[8??m/???Y]????f.|x~Fa]S?op1M?H?imm5??g?????k?K#?|??? ???????p:O ??(? P?FThq1??N4??P???X??lD???F???6??z?0[?}??z??|??+?pR"s?Lq??&g#?v[((J~??w1@-?G?8???'?V+ks0?????%???5)

And this is what I expected (using curl):

<html>
<head>
<link rel="mask-icon" sizes="any" href="http://www.sina.com.cn/favicon.svg" color="red">
<meta charset="gbk"/>
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
finch
  • 107
  • 8
  • What do you mean by `garbled` string? you should post what you get, and what you expected. – Anand S Kumar Sep 05 '15 at 08:14
  • You're looking at the source (in its entirety) of that page. If you want any sort of cleaned-up, relevant information I suggest you also try to integrate `BeautifulSoup` into your script. – Matt Sep 05 '15 at 08:15

1 Answers1

1

Here is a possible way to get the source information using requests and BeautifulSoup

import requests 
from bs4 import BeautifulSoup

#Url to request
url = "http://finance.sina.com.cn/china/20150905/065523161502.shtml"
r = requests.get(url)

#Use BeautifulSoup to organise the 'requested' content 
soup=BeautifulSoup(r.content,"lxml")
print soup
Tom Patel
  • 422
  • 1
  • 4
  • 11
  • It also works, thank you. And the reason of the encoding problem is the sever return compressed content but urllib2 doesn't decompress the data automatically. So can also decompress the data manually. – finch Sep 15 '15 at 07:07