UnicodeDecodeError problem with mechanize

Question

I receive the following string from one website via mechanize:

'We\x92ve'

I know that \x92 stands for ’ character. I'm trying to convert that string to Unicode:

>> unicode('We\x92ve','utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 2: unexpected code byte

What am I doing wrong?

Edit: The reason I tried 'utf-8' was this:

>> response = browser.response()
>> response.info()['content-type']
'text/html; charset=utf-8'

Now I see I can't always trust content-type header.

Well... you *should*, in general, trust the `Content-Type` header. When there is a `charset` declared in it, that's the definitive encoding of the page (even overriding a `` charset), and all browsers will use that encoding. So the page you are fetching is simply broken; a lone `\x92` byte in it would appear in browsers as a `�` mark. It's best not to second-guess a stated `charset` unless you've really no other choice; generally you should only fall back to `chardet`-style sniffing when no recognised `charset` is declared. (Again, that's what browsers do.) — bobince, Feb 21 '10 at 13:55
@bobince I checked that page in few modern browsers and they all show ’ character. I'm confused. — parxier, Feb 21 '10 at 14:01
@bobince Here it is: https://www.virginmobile.com.au/selfcare/MyAccount/login.jsp. Look at 'Customer Updates' section from `10 Dec 2009`: `We’ve changed the name of our $29/$49 Free2V Vouchers`. Now I look at HTTP headers in Firefox/Chrome and I see: Content-Type:text/html; charset=iso-8859-1, but still content="text/html; charset=utf-8" meta tag in HTML. They either changed `Content-Type` header overnight or mechanaise returned incorrect headings. I'm cunfused even more now. :-) — parxier, Feb 21 '10 at 23:54
For me, that page is returning the HTTP response header `Content-Type: text/html; charset=iso-8859-1`, which matches the encoding of the smart quote(\*) fine. The page contains an incorrect `` charset declaration of UTF-8, but the HTTP header takes precedence, forcing browsers to render correctly. I don't know why you might be getting the `Content-Type` header with `utf-8`; if I do a `urllib.urlopen` to that page I definitely get the `iso-8859-1` response. — bobince, Feb 23 '10 at 00:06
(\*: well, of course it doesn't because the smart quote isn't included in ISO-8859-1; it's actually Windows code page 1252. But the way browsers get ISO-8859-1 and CP1252 confused is a different issue of no relevance here.) — bobince, Feb 23 '10 at 00:08

score 4 · Accepted Answer · answered Feb 21 '10 at 13:30

4

\x92 stands for ’ alright, but it does so in the Windows-1252 encoding, not in UTF-8:

>>> print unicode('We\x92ve','1252')
We’ve

If you don't know what encoding your source data is in, you can detect it using chardet (extremely easy to use).

answered Feb 21 '10 at 13:30

Max Shawabkeh

37,799
10
82
91

Thanks, Max. I trusted content-type header that was incorrect. I'll definitely look at chardet. – parxier Feb 21 '10 at 13:42

UnicodeDecodeError problem with mechanize

1 Answers1

Linked