1

My Python 2.x script trys to download a web page including Chinese words. It's encoded in UTF-8. By urllib.openurl(url), I get content in type str, so I decode content with UTF-8. It throws UnicodeEncodeError. I googled a lot of posts like this and this, but they don't work for me. Am I misunderstand something?

My code is:

import urllib
import httplib
def get_html_content(url):
    response = urllib.urlopen(url)
    html = response.read()
    print type(html)
    return html


if __name__ == '__main__':
    url = 'http://weekly.manong.io/issues/58'
    html = get_html_content(url)
    print html.decode('utf-8')

Error message:

<type 'str'>
Traceback (most recent call last):
  File "E:\src\infra.py", line 32, in <module>
    print html.decode('utf-8')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 44: ordinal not in range(128)
[Finished in 1.6s]
Community
  • 1
  • 1
KyL
  • 987
  • 12
  • 24

2 Answers2

2

print statement converts arguments to str objects. Encoding it manually will prevent to encode it with ascii:

import sys

...

if __name__ == '__main__':
    url = 'http://weekly.manong.io/issues/58'
    html = get_html_content(url)
    print html.decode('utf-8').encode(sys.stdout.encoding, 'ignore')

Replace sys.stdout.encoding with encoding of your terminal unless it print correctly.

UPDATE

Alternatively you can use PYTHONIOENCODING environmental variable without encoding in the source code:

PYTHONIOENCODING=utf-8:ignore python program.py
falsetru
  • 357,413
  • 63
  • 732
  • 636
  • 2
    `print unicode_string` is the preferable way. Do not hardcode the encoding of your environment inside your script. [Use `PYTHONIOENCODING` instead](http://stackoverflow.com/a/28011696/4279) – jfs Jan 18 '15 at 16:10
  • 1
    *"print implicitly call str"* is oversimplification e.g., it does not explain why the first command in my answer works (it prints non-ascii character to the terminal). – jfs Jan 19 '15 at 03:15
  • @J.F.Sebastian, Thank you for comments. I edited the sentence according to you. – falsetru Jan 19 '15 at 03:55
  • what *"default encoding"* do you mean? Is it `sys.getdefaultencoding()`, `locale.getpreferredencoding(False)`, `locale.getpreferredencoding(True)`, `sys.getfilesystemencoding()`? `sys.stdout.encoding` may be `None` (try the second command in my answer). – jfs Jan 20 '15 at 10:38
1

If the standard output is redirected to a pipe then Python 2 fails to use your locale encoding:

⟫ python -c'print u"\u201c"' # no redirection -- works
“
⟫ python -c'print u"\u201c"' | cat
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128)

To fix it; you could specify PYTHONEIOENCODING environment variable e.g., in bash:

⟫ PYTHONIOENCODING=utf-8 python -c'print u"\u201c"' | cat
“

On Windows, you need to set the envvar using a different syntax.

If your Windows console doesn't support utf-8 (it matters only for the first command where there is no redirection) then you could try to print Unicode directly using Win32 API calls like win-unicode-console does. See windows console doesn't print or input Unicode.

jfs
  • 399,953
  • 195
  • 994
  • 1,670