3

I receive the following string from one website via mechanize:

'We\x92ve'

I know that \x92 stands for character. I'm trying to convert that string to Unicode:

>> unicode('We\x92ve','utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 2: unexpected code byte

What am I doing wrong?

Edit: The reason I tried 'utf-8' was this:

>> response = browser.response()
>> response.info()['content-type']
'text/html; charset=utf-8'

Now I see I can't always trust content-type header.

parxier
  • 3,811
  • 5
  • 42
  • 54
  • Well... you *should*, in general, trust the `Content-Type` header. When there is a `charset` declared in it, that's the definitive encoding of the page (even overriding a `` charset), and all browsers will use that encoding. So the page you are fetching is simply broken; a lone `\x92` byte in it would appear in browsers as a `�` mark. It's best not to second-guess a stated `charset` unless you've really no other choice; generally you should only fall back to `chardet`-style sniffing when no recognised `charset` is declared. (Again, that's what browsers do.) – bobince Feb 21 '10 at 13:55
  • @bobince I checked that page in few modern browsers and they all show ’ character. I'm confused. – parxier Feb 21 '10 at 14:01
  • @bobince Here it is: https://www.virginmobile.com.au/selfcare/MyAccount/login.jsp. Look at 'Customer Updates' section from `10 Dec 2009`: `We’ve changed the name of our $29/$49 Free2V Vouchers`. Now I look at HTTP headers in Firefox/Chrome and I see: Content-Type:text/html; charset=iso-8859-1, but still content="text/html; charset=utf-8" meta tag in HTML. They either changed `Content-Type` header overnight or mechanaise returned incorrect headings. I'm cunfused even more now. :-) – parxier Feb 21 '10 at 23:54
  • 3
    For me, that page is returning the HTTP response header `Content-Type: text/html; charset=iso-8859-1`, which matches the encoding of the smart quote(\*) fine. The page contains an incorrect `` charset declaration of UTF-8, but the HTTP header takes precedence, forcing browsers to render correctly. I don't know why you might be getting the `Content-Type` header with `utf-8`; if I do a `urllib.urlopen` to that page I definitely get the `iso-8859-1` response. – bobince Feb 23 '10 at 00:06
  • (\*: well, of course it doesn't because the smart quote isn't included in ISO-8859-1; it's actually Windows code page 1252. But the way browsers get ISO-8859-1 and CP1252 confused is a different issue of no relevance here.) – bobince Feb 23 '10 at 00:08

1 Answers1

4

\x92 stands for alright, but it does so in the Windows-1252 encoding, not in UTF-8:

>>> print unicode('We\x92ve','1252')
We’ve

If you don't know what encoding your source data is in, you can detect it using chardet (extremely easy to use).

Max Shawabkeh
  • 37,799
  • 10
  • 82
  • 91
  • Thanks, Max. I trusted content-type header that was incorrect. I'll definitely look at chardet. – parxier Feb 21 '10 at 13:42