The question about unicode in Python2.
As I know about this I should always decode
everything what I read from outside (files, net). decode
converts outer bytes to internal Python strings using charset specified in parameters. So decode("utf8")
means that outside bytes are unicode string and they will be decoded to python strings.
Also I should always encode
everything what I write to outside. I specify encoding in parameters of encode
function and it converts to proper encoding and writes.
These statements are right, ain't they?
But sometimes when I parse html documents I get decode errors. As I understand the document in other encoding (for example cp1252
) and error happens when I try to decode this using utf8 encoding. So the question is how to write bulletproof application?
I found that there is good library to guess encoding is chardet and this is the only way to write bulletproof applications. Right?