Is there a Python library function which attempts to guess the character-encoding of some bytes?

Question

I'm writing some mail-processing software in Python that is encountering strange bytes in header fields. I suspect this is just malformed mail; the message itself claims to be us-ascii, so I don't think there is a true encoding, but I'd like to get out a unicode string approximating the original one without throwing a UnicodeDecodeError.

So, I'm looking for a function that takes a str and optionally some hints and does its darndest to give me back a unicode. I could write one of course, but if such a function exists its author has probably thought a bit deeper about the best way to go about this.

I also know that Python's design prefers explicit to implicit and that the standard library is designed to avoid implicit magic in decoding text. I just want to explicitly say "go ahead and guess".

score 27 · Answer 1 · edited Jun 15 '23 at 18:24

27

+1 for the chardet module.

It is not in the standard library, but you can easily install it with the following command:

$ pip install chardet

Example:

>>> import urllib.request
>>> rawdata = urllib.request.urlopen('http://yahoo.co.jp/').read()
>>> import chardet
>>> chardet.detect(rawdata)
{'encoding': 'EUC-JP', 'confidence': 0.99}

See Installing Pip if you don't have one.

edited Jun 15 '23 at 18:24

Nuno André

4,739
1
33
46

answered Nov 06 '08 at 16:13

jfs

399,953
195
994
1,670

1

Didn't it strike you that `ISO-8859-2` was a nonsense? – John Machin Aug 28 '10 at 02:05
@John Machin: Yes, it was. It is educational to show that you should not blindly trust it. Current results are different ('utf-8' and 'ascii' correspondingly). – jfs Aug 28 '10 at 06:41
https://pypi.org/project/chardet/ – milahu Sep 04 '22 at 15:18
why is it nonsense? https://en.wikipedia.org/wiki/ISO/IEC_8859-2 – Eric Apr 07 '23 at 11:08

score 16 · Accepted Answer · answered Nov 07 '08 at 21:03

As far as I can tell, the standard library doesn't have a function, though it's not too difficult to write one as suggested above. I think the real thing I was looking for was a way to decode a string and guarantee that it wouldn't throw an exception. The errors parameter to string.decode does that.

def decode(s, encodings=('ascii', 'utf8', 'latin1')):
    for encoding in encodings:
        try:
            return s.decode(encoding)
        except UnicodeDecodeError:
            pass
    return s.decode('ascii', 'ignore')

You could skip the `'ascii'` case at the end and just use `latin1`, since `latin1` will decode all 256 byte values without error. — Mark Ransom, Apr 03 '17 at 15:50

score 2 · Answer 3 · answered Nov 07 '08 at 02:31

2

The best way to do this that I've found is to iteratively try decoding a prospective with each of the most common encodings inside of a try except block.

answered Nov 07 '08 at 02:31

Jeremy Cantrell

26,392
13
55
78

Is there a Python library function which attempts to guess the character-encoding of some bytes?

3 Answers3

Linked