7

According to this answer: urllib2 read to Unicode

I have to get the content-type in order to change to Unicode. However, some websites don't have a "charset".

For example, the ['content-type'] for this page is "text/html". I can't convert it to Unicode.

encoding=urlResponse.headers['content-type'].split('charset=')[-1]
htmlSource = unicode(htmlSource, encoding)
TypeError: 'int' object is not callable

Is there a default "encoding" (English, of course)...so that if nothing is found, I can just use that?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
TIMEX
  • 259,804
  • 351
  • 777
  • 1,080
  • I have updated my comment, if you like to use one decode function all the time. – YOU Nov 27 '09 at 13:32
  • AAARRRGGHHH check out the URL, it does have a charset; read the error message, the code is shadowing the unicode() function FFS – John Machin Nov 27 '09 at 13:50
  • heh! and none of us spotted it! – bobince Nov 27 '09 at 13:57
  • 1
    @bobince: Yeah, SO needs an "I was wrong" button so that you can surrender your ill-gotten points but leave your answer there --- suitably labelled of course :-) – John Machin Nov 27 '09 at 14:14

4 Answers4

3

Is there a default "encoding" (English, of course)...so that if nothing is found, I can just use that?

No, there isn't. You must guess.

Trivial approach: try and decode as UTF-8. If it works, great, it's probably UTF-8. If it doesn't, choose a most-likely encoding for the kinds of pages you're browsing. For English pages that's cp1252, the Windows Western European encoding. (Which is like ISO-8859-1; in fact most browsers will use cp1252 instead of iso-8859-1 even if you specify that charset, so it's worth duplicating that behaviour.)

If you need to guess other languages, it gets very hairy. There are existing modules to help you guess in these situations. See eg. chardet.

bobince
  • 528,062
  • 107
  • 651
  • 834
  • Can I just do: htmlSource = htmlSource.decode('utf8')...for everything? – TIMEX Nov 27 '09 at 13:17
  • 3
    http has a default encoding, see the RFC http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7.1 – wds Nov 27 '09 at 13:39
  • AAARRRGGHHH check out the URL, it does have a charset; read the error message, the code is shadowing the unicode() function FFS – John Machin Nov 27 '09 at 13:49
  • @wds: technically yes, but nothing in the world obeys that rule. :-( – bobince Nov 27 '09 at 13:56
  • 2
    The default charset is ISO-8859-1. See RFC2616, sections 3.7.1 and 3.4.1 – Denis Otkidach Nov 27 '09 at 14:41
  • @wds and Denis: The specifications clash here, because XML has a standard encoding of UTF-8, if nothing is provided. This is a known issue (known to the spec authors, that is). – Boldewyn Nov 27 '09 at 14:48
  • "De facto" standard for HTTP is UTF-8. – Martin Andersson Oct 04 '14 at 07:41
  • RFC 2616 specified ISO-8859-1 only for headers, not body content. And whilst UTF-8 is increasingly popular (and the only sensible) encoding, it is unlikely a page whose encoding you have to guess (ie one without a `charset` or faux-BOM) will be UTF-8. Pages without encodings are typically written in the author's locale-specific Windows default codepage `cp1252` et al as above. – bobince Oct 04 '14 at 09:02
  • 2
    RFC 7231 obsoletes the old ISO-8859-1 default. http://tools.ietf.org/html/rfc7231#appendix-B – Hawkeye Parker Nov 10 '14 at 06:13
  • @bobince "RFC 2616 specified ISO-8859-1 only for headers, not body content." That obsolete RFC says otherwise: "When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP." http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7.1 – Hawkeye Parker Nov 10 '14 at 06:15
3

Well, I just browsed the given URL, which redirects to

http://www.engadget.com/2009/11/23/apple-hits-back-at-verizon-in-new-iphone-ads-video

then hit Ctrl + U (view source) in Firefox and it shows

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

@Konrad: what do you mean "seems as though ... uses ISO-8859-1"??

@alex: what makes you think it doesn't have a "charset"??

Look at the code you have (which we guess is the line that cause the error (please always show full traceback and error message!)):

htmlSource = unicode(htmlSource, encoding)

and the error message:

TypeError: 'int' object is not callable

That means that unicode doesn't refer to the built-in function, it refers to an int. I recall that in your other question you had something like

if unicode == 1:

I suggest that you use some other name for that variable -- say use_unicode.

More suggestions: (1) always show enough code to reproduce the error (2) always read the error message.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
John Machin
  • 81,303
  • 11
  • 141
  • 189
0

htmlSource=htmlSource.decode("utf8") should work for most cases, except you are crawling non-English encoding sites.

Or you could write the force decode function like this:

def forcedecode(text):
    for x in ["utf8","sjis","cp1252","utf16"]:
        try:return text.decode(x)
        except:pass
    return "Unknown Encoding"
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
YOU
  • 120,166
  • 34
  • 186
  • 219
  • `cp1252` and `utf16` will successfully decode *any* byte sequence, so one of those would have to go at the end. (I suggest `cp1252`; UTF-16 is not widely used on the web as there are browser problems as well as it being generally inefficient.) – bobince Nov 27 '09 at 13:59
  • thx, I moved it to the end. and yes utf16 is not widely used, but in my language utf8 takes 3 bytes, but utf16 took 2 bytes only we use it for nomally text or csv files, and microsoft excel does not look good with utf8 encoded csv files. – YOU Nov 27 '09 at 14:06
  • @bobince: sorry to say, your first sententence is not correct. You may be thinking of ISO-8859-1 aka latin1 which does have that property (as does any other single-byte kit that defines all codepoints). cp1252 doesn't define 5 byte values e.g. `'\x81'`. UTF-16 will bork on lone low surrogate, and on high surrogate not followed by low surrogate. – John Machin Nov 27 '09 at 14:40
  • thx John, it make sense, utf16 might work from 0-FFFF, moved it to end, but only one thing come on my mind, what will be this encoding "\xff\xff\81"? just crafted one? – YOU Nov 27 '09 at 14:44
  • @S.Mark: "work" does NOT mean "didn't raise an exception". If you are asking for which encoding that 3-byte sequence would be valid, the answer is obviously ISO-8859-1 (or any other encoding for which all points are defined); if you add the rider that the sequence must be practical and meaningful, that cuts out ISO-8859-1 because \x80 to \x9f both inclusive are not-useful control characters. Why do you ask? – John Machin Nov 27 '09 at 15:32
  • @John, sorry, my question was not clear, I meant utf16 need 2 bytes to decode it, but if total size is 3 bytes like that "\xff\xff\x81", the decoding will get exception or does not throw, sorry for confusing you with unrelated question. thx – YOU Nov 27 '09 at 15:44
  • @S.Mark: "the decoding will get exception or does not throw" -- of course, there are no other alternatives. If you mean to ask what will happen if an attempt is made to decode an odd-sized string with UTF16xE, well of course it throws an exception (which you could find out for yourself). I am not confused; you would need to work harder than that :-) – John Machin Nov 27 '09 at 15:57
0

If there's no explicit content type, it should be ISO-8859-1 as stated earlier in the answers. Unfortunately that's not always the case, which is why browser developers spent some time on getting algorithms going that try to guess the content type based on the content of your page.

Luckily for you, Mark Pilgrim did all the hard work on porting the Firefox implementation to Python, in the form of the chardet module. His introduction on how it works for one of the chapters of Dive Into Python 3 is also well worth reading.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
wds
  • 31,873
  • 11
  • 59
  • 84