python urllib2 utf-8 encoding

Question

okay, I have: # -*- coding: utf-8 -*- in my python file.

the snippet:

opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
opener.addheaders = [('Accept-Charset', 'utf-8')]
f =opener.open(url)
doc = f.read().decode('utf-8')

The server response is: (via f.info())

Content-Type: text/html; charset=UTF-8

but i get the error:

UnicodeDecodeError: 'utf8' codec can't decode byte[...]: invalid continuation byte

What's wrong here?

score 3 · Accepted Answer · edited May 23 '17 at 10:27

3

Try decoding the data using 'latin-1' to see what it looks like. What you're seeing indicates a UTF-8 decode error (see UnicodeDecodeError, invalid continuation byte ).

It would be helpful if you posted the result of list(f.read())[:100] so we can see the data.

FYI, putting # -*- coding: utf-8 -*- is unrelated to your issue. That encoding refers to the encoding of your python script itself, not the data it is handling :-)

edited May 23 '17 at 10:27

Community

1
1

answered Nov 11 '11 at 23:22

Raymond Hettinger

216,523
63
388
485

Thanks for your reply. list(f.read())[:100] is:`['<', '!', 'D', 'O', 'C', 'T', 'Y', 'P', 'E', ' ', 'h', 't', 'm', 'l', ' ', 'P', 'U', 'B', 'L', 'I', 'C', ' ', '"', '-', '/', '/', 'W', '3', 'C', '/', '/', 'D', 'T', 'D', ' ', 'X', 'H', 'T', 'M', 'L', ' ', '1', '.', '0', ' ', 'S', 't', 'r', 'i', 'c', 't', '/', '/', 'E', 'N', '"', ' ', '"', 'h', 't', 't', 'p', ':', '/', '/', 'w', 'w', 'w', '.', 'w', '3', '.', 'o', 'r', 'g', '/', 'T', 'R', '/', 'x', 'h', 't', 'm', 'l', '1', '/', 'D', 'T', 'D', '/', 'x', 'h', 't', 'm', 'l', '1', '-', 's', 't', 'r']` – Nov 11 '11 at 23:32

score 1 · Answer 2 · edited May 23 '17 at 11:44

1

That particular error is commonly caused by trying to decode using utf-8 when the string was actually encoded with latin1. See UnicodeDecodeError, invalid continuation byte for some more info.

I suspect that despite the header, the server is not returning utf8 encoded content.

A solution that might be worth pursuing is to use chardet to 'guess' which encoding is used. Despite chardet's awesomeness consider it a last resort however.

edited May 23 '17 at 11:44

Community

1
1

answered Nov 11 '11 at 23:16

Rob Cowie

22,259
6
62
56

Thanks @Rob Cowie, I tried to use chardet. It's 'guessing' wrong I guess ;). I also tried some random encondings, but I always get shitty text. Surprisingly (for me) I get different outputs for the same character. – Nov 12 '11 at 01:06

python urllib2 utf-8 encoding

2 Answers2

Linked