Detect charset and convert to utf-8 in Python?

Question

Is there any universal method to detect string charset? I user IPTC tags and have no known encoding. I need to detect it and then change them to utf-8.

Anybody can help?

Looking at your comment to @Ignacio, I would invite you to paste a couple of examples of "None" string into your question, so that we can play around with them and understand what the issue is. It would be helpful if you could also paste their correct decoded version as done on the portal you mentioned. — mac, Jul 15 '11 at 13:41

score 39 · Accepted Answer · edited Jan 09 '13 at 05:46

39

You want to use chardet, an encoding detector

edited Jan 09 '13 at 05:46

NullUserException

83,810
28
209
234

answered Jul 15 '11 at 13:25

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

3

It doesn't work, I trierd it before asking here. Some strings get None encoding but it's not true. Tags are encoded somehow because on 1 web portal they are recognized. – robos85 Jul 15 '11 at 13:32
+1: chardet seems to be one of the best current ways of doing encoding detection. @robos85: It is not possible to do a perfect encoding detection: http://stackoverflow.com/questions/436220/python-is-there-a-way-to-determine-the-encoding-of-text-file/436299#436299. – Eric O. Lebigot Jul 15 '11 at 13:44
2

I have developed a much more successful way of detecting the encoding, based on knowing the language. It gets the 8-bit encodings right. Finally. – tchrist Feb 05 '12 at 11:02
27

@tchrist: care to share? – MestreLion Oct 22 '13 at 15:21

score 18 · Answer 2 · answered Feb 04 '12 at 00:12

18

It's a bit late, but there is also another solution: try to use pyicu.

An example:

import icu
def convert_encoding(data, new_coding='UTF-8'):
    coding = icu.CharsetDetector(data).detect().getName()
    if new_coding.upper() != coding.upper():
        data = unicode(data, coding).encode(new_coding)
    return data

answered Feb 04 '12 at 00:12

parkouss

576
6
5

pyicu based on icu, and sometimes will miss-detect some encoding: http://sourceforge.net/p/icu/mailman/icu-design/thread/OFC8C84672.06B930B4-ON85257BDC.005E1362-85257BDC.005DE5A0@lotus.com/ – coanor Oct 24 '14 at 03:06
7

@coanor: *any* encoding detector will fail in some cases, as there is no way to accurately determine the encoding for all tests – MestreLion Nov 28 '14 at 15:09

score 17 · Answer 3 · answered Oct 15 '14 at 12:32

17

If you want to do it with cchardet, you can use this function.

import cchardet
def convert_encoding(data, new_coding = 'UTF-8'):
  encoding = cchardet.detect(data)['encoding']

  if new_coding.upper() != encoding.upper():
    data = data.decode(encoding, data).encode(new_coding)

  return data

answered Oct 15 '14 at 12:32

teawithfruit

767
1
11
22

I tried many encoding format (base64, ... ) . the result is always ascii – chourn solidet Jan 27 '21 at 15:33

laike9m · Answer 4 · 2013-12-12T17:14:20.793

5

There is another module called cchardet

It is said to be faster than chardet.

Note that it requires Cython

edited Dec 12 '13 at 17:14

answered Dec 12 '13 at 17:06

laike9m

18,344
20
107
140

Detect charset and convert to utf-8 in Python?

4 Answers4

Linked