38

Is there any universal method to detect string charset? I user IPTC tags and have no known encoding. I need to detect it and then change them to utf-8.

Anybody can help?

jww
  • 97,681
  • 90
  • 411
  • 885
robos85
  • 2,484
  • 5
  • 32
  • 36
  • Looking at your comment to @Ignacio, I would invite you to paste a couple of examples of "None" string into your question, so that we can play around with them and understand what the issue is. It would be helpful if you could also paste their correct decoded version as done on the portal you mentioned. – mac Jul 15 '11 at 13:41

4 Answers4

39

You want to use chardet, an encoding detector

NullUserException
  • 83,810
  • 28
  • 209
  • 234
Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
  • 3
    It doesn't work, I trierd it before asking here. Some strings get None encoding but it's not true. Tags are encoded somehow because on 1 web portal they are recognized. – robos85 Jul 15 '11 at 13:32
  • +1: chardet seems to be one of the best current ways of doing encoding detection. @robos85: It is not possible to do a perfect encoding detection: http://stackoverflow.com/questions/436220/python-is-there-a-way-to-determine-the-encoding-of-text-file/436299#436299. – Eric O. Lebigot Jul 15 '11 at 13:44
  • 2
    I have developed a much more successful way of detecting the encoding, based on knowing the language. It gets the 8-bit encodings right. Finally. – tchrist Feb 05 '12 at 11:02
  • 27
    @tchrist: care to share? – MestreLion Oct 22 '13 at 15:21
18

It's a bit late, but there is also another solution: try to use pyicu.

An example:

import icu
def convert_encoding(data, new_coding='UTF-8'):
    coding = icu.CharsetDetector(data).detect().getName()
    if new_coding.upper() != coding.upper():
        data = unicode(data, coding).encode(new_coding)
    return data
parkouss
  • 576
  • 6
  • 5
  • pyicu based on icu, and sometimes will miss-detect some encoding: http://sourceforge.net/p/icu/mailman/icu-design/thread/OFC8C84672.06B930B4-ON85257BDC.005E1362-85257BDC.005DE5A0@lotus.com/ – coanor Oct 24 '14 at 03:06
  • 7
    @coanor: *any* encoding detector will fail in some cases, as there is no way to accurately determine the encoding for all tests – MestreLion Nov 28 '14 at 15:09
17

If you want to do it with cchardet, you can use this function.

import cchardet
def convert_encoding(data, new_coding = 'UTF-8'):
  encoding = cchardet.detect(data)['encoding']

  if new_coding.upper() != encoding.upper():
    data = data.decode(encoding, data).encode(new_coding)

  return data
teawithfruit
  • 767
  • 1
  • 11
  • 22
5

There is another module called cchardet

It is said to be faster than chardet.

Note that it requires Cython

laike9m
  • 18,344
  • 20
  • 107
  • 140