Bulletproof work with encoding in Python

Question

The question about unicode in Python2.

As I know about this I should always decode everything what I read from outside (files, net). decode converts outer bytes to internal Python strings using charset specified in parameters. So decode("utf8") means that outside bytes are unicode string and they will be decoded to python strings.

Also I should always encode everything what I write to outside. I specify encoding in parameters of encode function and it converts to proper encoding and writes.

These statements are right, ain't they?

But sometimes when I parse html documents I get decode errors. As I understand the document in other encoding (for example cp1252) and error happens when I try to decode this using utf8 encoding. So the question is how to write bulletproof application?

I found that there is good library to guess encoding is chardet and this is the only way to write bulletproof applications. Right?

The document will often come with a statement of its encoding somewhere. [See if the charset is specified.](http://en.wikipedia.org/wiki/Internet_media_type) If you don't have the encoding, or if the stated encoding is wrong, you may run into inevitable errors. The best you can hope to do then is handle the errors with some policy and produce a result that maybe kind of looks like what you want. — user2357112, Dec 03 '13 at 10:26
I find the best overview of encoding issues in Python is the following page : http://sebsauvage.net/python/snyppets/#unicode Furthermore, there's one little misconception in your post : the «decode» function will always give you Unicode. I agree with the previous comment, i.e.: Rely on charset declaration first and foremost. There really isn't a "bullet-proof" way to deal with wrongly-encoded documents. — Fred Osterrath, Dec 03 '13 at 13:19
https://stackoverflow.com/questions/606191/convert-bytes-to-a-python-string/27527728 can help with this also. — anatoly techtonik, Jan 16 '17 at 16:56

score 1 · Answer 1 · answered Dec 08 '13 at 20:58

... decode("utf8") means that outside bytes are unicode string and they will be decoded to python strings.

...

These statements are right, ain't they?

No, outside bytes are binary data, they are not a unicode string. So <str>.decode("utf8") will produce a Python unicode object by interpreting the bytes in <str> as UTF-8; it may raise an error if the bytes cannot be decoded as UTF-8.

Determining the encoding of any given document is not necessarily a simple task. You either need to have some external source of information that tells you the encoding, or you need to know something about what is in the document. For example, if you know that it is an HTML document with its encoding specified internally, then you can parse the document using an algorithm like the one outlined in the HTML Standard to find the encoding and then use that encoding to parse the document (it's a two-pass operation). However, just because an HTML document specifies an encoding it does not mean that it can be decoded with that encoding. You may still get errors if the data is corrupt or if document was not encoded properly in the first place.

There are libraries such as chardet (I see you mentioned it already) that will try to guess the encoding of a document for you (it's only a guess, not necessarily correct). But they can have their own issues such as performance, and they may not recognize the encoding of your document.

This one gets my vote; but maybe emphasize that `chardet` is a heuristic guesser which doesn't attempt to verify that the result it reports is valid -- you will need to prepare for decoding errors, and perhaps fall back to a second-best guess, or just give up (for example, a misconfigured web server could be serving you a bunch of JPEG bytes, which simply don't *have* a language). — tripleee, Dec 16 '15 at 15:22

score 1 · Answer 2 · answered Dec 08 '13 at 21:14

Try wrapping your functions in try:except: calls.

Try decoding as utf-8:
Catch exception if not utf-8:
if exception raised, try next encoding:
etc, etc...

Make it a function that returns str when (and if) it finds an encoding that wasn't excepted, and returns None or an empty str when it exhausts its list of encodings and the last exception is raised.

Like the others said, the encoding should be recorded somewhere, so check that first.

Not efficient, and frankly due to my skill level, may be way off, but to my newbie mind, it may alleviate some of the problems when dealing with unknown or undocumented encoding.

score -1 · Answer 3 · answered Dec 16 '15 at 14:58

-1

Convert to unicode from cp437. This way you get your bytes right to unicode and back.

answered Dec 16 '15 at 14:58

anatoly techtonik

19,847
9
124
140

If you -1, say if you need explanation why `cp437`. If you just don't like `cp437` because `unicode` is trendy, it can't be helped. – anatoly techtonik Dec 17 '15 at 08:56

Bulletproof work with encoding in Python

3 Answers3

Linked

Related