Python error with decode utf-8 and Japanese characters

Question

Traceback (most recent call last):
  File "C:\Program Files (x86)\Python\Projects\test.py", line 70, in <module>
    html = urlopen("https://www.google.co.jp/").read().decode('utf-8')
  File "C:\Program Files (x86)\Python\lib\http\client.py", line 506, in read
    return self._readall_chunked()
  File "C:\Program Files (x86)\Python\lib\http\client.py", line 592, in _readall_chunked
    value.append(self._safe_read(chunk_left))
  File "C:\Program Files (x86)\Python\lib\http\client.py", line 664, in _safe_read
    raise IncompleteRead(b''.join(s), amt)
http.client.IncompleteRead: IncompleteRead(5034 bytes read, 3158 more expected)

So I am trying to get data from a website but it seems whenever it comes across Japanese characters or other unreadable characters it comes up with this error. All I am using is urlopen and .read().decode('utf-8'). Is there some way I can just ignore all of them or replace them all so there is no error?

score 0 · Answer 1 · edited May 23 '17 at 12:11

In the code you posted, there is no problem with character encoding. Instead you have a problem getting the whole HTTP response. (Look closely at the error message.)

I tried this in an interactive Python shell:

>>> import urllib2
>>> url = urllib2.urlopen("https://www.google.co.jp/")
>>> body = url.read()
>>> len(body)
11155

This worked.

>>> body.decode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0x90 in position 102: invalid start byte

Ok, there is indeed an encoding error.

>>> url.headers['Content-Type']
'text/html; charset=Shift_JIS'

This is because your HTTP response is not encoded in UTF-8, but in Shift-JIS.

You should probably not use urllib2 but a higher level library that takes care of the HTTP encoding. Or, if you want to do it yourself, see https://stackoverflow.com/a/20714761.

score 0 · Answer 2 · answered Jun 28 '14 at 11:16

Use requests and BeautifulSoup:

import requests

r = requests.get("https://www.google.co.jp/")

soup = BeautifulSoup(r.content)

print soup.find_all("p")

[<p style="color:#767676;font-size:8pt">© 2013 - <a href="/intl/ja/policies/">プライバシーと利用規約</a></p>]

Python error with decode utf-8 and Japanese characters

2 Answers2