0

Using Python 2.6.6 on CentOS 6.4

import json
import urllib2    

url = 'http://www.google.com.hk/complete/search?output=toolbar&hl=en&q=how%20to%20pronounce%20e'
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
opener.addheaders = [('Accept-Charset', 'utf-8')]
response = opener.open(url)
page = response.read()
print page

Result:

...<suggestion data="how to pronounce eyjafjallaj

at which Python dies with no error message.

I think it dies because the next character is ö:

<toplevel>
<CompleteSuggestion>
<suggestion data="how to pronounce edinburgh"/>
</CompleteSuggestion>
<CompleteSuggestion>
<suggestion data="how to pronounce elle"/>
</CompleteSuggestion>
<CompleteSuggestion>
<suggestion data="how to pronounce edith"/>
</CompleteSuggestion>
<CompleteSuggestion>
<suggestion data="how to pronounce et al"/>
</CompleteSuggestion>
<CompleteSuggestion>
<suggestion data="how to pronounce eunice"/>
</CompleteSuggestion>
<CompleteSuggestion>
<suggestion data="how to pronounce english names"/>
</CompleteSuggestion>
<CompleteSuggestion>
<suggestion data="how to pronounce edamame"/>
</CompleteSuggestion>
<CompleteSuggestion>
<suggestion data="how to pronounce erudite"/>
</CompleteSuggestion>
<CompleteSuggestion>
<suggestion data="how to pronounce eyjafjallajökull"/>
</CompleteSuggestion>
<CompleteSuggestion>
<suggestion data="how to pronounce either"/>
</CompleteSuggestion>
</toplevel>

http://www.google.com.hk/complete/search?output=toolbar&hl=en&q=how%20to%20pronounce%20e

This appears to be a unicode issue, I have tried encode('utf-8') and decode('utf-8') in many ways, but it still dies. Any ideas?

PS It seems I need to stay with urllib2 not urllib as urllib ignores cookies that causes other problems.

davidjhp
  • 7,816
  • 9
  • 36
  • 56
  • possible duplicate of [urllib2 read to Unicode](http://stackoverflow.com/questions/1020892/urllib2-read-to-unicode) – m.wasowski Mar 22 '14 at 02:27

1 Answers1

1

response.read() returns a bytestring. Python shouldn't die while printing a bytestring because no character conversion occurs, bytes are printed as is.

You could try to print Unicode instead:

text = page.decode(response.info().getparam('charset') or 'utf-8')
print text
jfs
  • 399,953
  • 195
  • 994
  • 1,670