How to find the encoding of a python3 bytes object

Question

I know that bytes.decode gives a string and string.encode gives the bytes, but only if the correct encoding is used.

Suppose I have a bytes object encoded using gb18030
If I try to decode it using big5:

>>name = '深入 damon'
>>b1 = name.encode('gb18030')
>>> b1.decode('big5')
UnicodeDecodeError: 'big5' codec can't decode byte 0xc8 in position 2: illegal multibyte sequence

Is there some way the encoding can be found from a bytes object?
I couldn't find any useful api in this regard in python3docs.

There isn't. If there was a general way of finding it, the decode function would not need its argument. The best you can get is educated guesses. The encoding is not part of the information contained in the bytes (unless you know are dealing with self-describing data, like HTML)), it is a property external to it. — R. Martinho Fernandes, May 31 '13 at 10:47

score 7 · Accepted Answer · edited Nov 22 '13 at 11:56

7

You can use the chardet package. Read this tutorial.

If you are using Ubuntu:

sudo apt-get install python3-chardet

If you are using pip:

pip install chardet2

edited Nov 22 '13 at 11:56

David

15
2

answered May 31 '13 at 03:48

kev

155,172
47
273
272

thanks @kev, I ran into the universaldetector import error ,but then your link to the dip3 case study helped..much obliged – damon May 31 '13 at 04:06

score 4 · Answer 2 · answered May 31 '13 at 04:08

Since you've entered it from the console, the encoding will be sys.stdin.encoding

>>> name = '深入 damon'
>>> import sys
>>> sys.stdin.encoding
'UTF-8'
>>> b1 = name.decode(sys.stdin.encoding)
>>> b1
u'\u6df1\u5165 damon'
>>> b1.encode(sys.stdin.encoding)
'\xe6\xb7\xb1\xe5\x85\xa5 damon'
>>> print b1.encode(sys.stdin.encoding)
深入 damon

How to find the encoding of a python3 bytes object

2 Answers2