57

I've got about 1000 filenames read by os.listdir(), some of them are encoded in UTF8 and some are CP1252.

I want to decode all of them to Unicode for further processing in my script. Is there a way to get the source encoding to correctly decode into Unicode?

Example:

for item in os.listdir(rootPath):

    #Convert to Unicode
    if isinstance(item, str):
        item = item.decode('cp1252')  # or item = item.decode('utf-8')
    print item
BartoszKP
  • 34,786
  • 15
  • 102
  • 130
Philipp
  • 1,001
  • 3
  • 10
  • 10

6 Answers6

68

Use chardet library. It is super easy

import chardet

the_encoding = chardet.detect('your string')['encoding']

and that's it!

in python3 you need to provide type bytes or bytearray so:

import chardet
the_encoding = chardet.detect(b'your string')['encoding']
george
  • 1,729
  • 16
  • 16
  • 1
    Seems to me it doesnt work. I have created string variable and encoded it utf-8. chardet returned TIS-620 encoding. – Taras Vaskiv Jun 08 '18 at 15:20
  • I found that cchardet appears to be the current name for this or a similar library...; chardet was not findable. – Martin Haeberli Feb 19 '19 at 04:20
  • A bit confused here. It seems like it isn't possible to provide an str class as an argument. Only b'your string' works for me, or directly providing a byte variable. – Yoav Vollansky Aug 06 '19 at 14:26
  • The problem with this answer for me is that some cp1252/latin1 characters can be interpreted as technically valid utf8 - which leads to `ê` type characters where it should have been `ê`. `chardet` seems to try utf8 first, which results in this. There may be a way to tell it which order to use, but [lucemia's answer](https://stackoverflow.com/a/15918519/623519) worked better for me. – artfulrobot Dec 21 '19 at 08:39
  • ↑ sorry, I think I got utf8 and cp1252 the wrong way round in my description in last comment! – artfulrobot Dec 21 '19 at 08:44
  • 2
    In Python 3: `TypeError: Expected object of type bytes or bytearray, got: ` – HelloGoodbye Sep 26 '20 at 00:26
  • @HelloGoodbye You need to provide a byte string or bytearray, not a string to decode. – Frederick Reynolds Mar 08 '21 at 19:01
  • `>>> chardet.detect("ö".encode())` and `{'encoding': 'TIS-620', 'confidence': 0.99, 'language': 'Thai'}` — I'd say that doesn't work. – kontur May 13 '22 at 08:38
37

if your files either in cp1252 and utf-8, then there is an easy way.

import logging
def force_decode(string, codecs=['utf8', 'cp1252']):
    for i in codecs:
        try:
            return string.decode(i)
        except UnicodeDecodeError:
            pass

    logging.warn("cannot decode url %s" % ([string]))

for item in os.listdir(rootPath):
    #Convert to Unicode
    if isinstance(item, str):
        item = force_decode(item)
    print item

otherwise, there is a charset detect lib.

Python - detect charset and convert to utf-8

https://pypi.python.org/pypi/chardet

Community
  • 1
  • 1
lucemia
  • 6,349
  • 5
  • 42
  • 75
12

You also can use json package to detect encoding.

import json

json.detect_encoding(b"Hello")
Suyog Shimpi
  • 706
  • 1
  • 8
  • 16
1

charset_normalizer is a drop in replacement for chardet.

It works better on natural language and has a permissive MIT licence: https://github.com/Ousret/charset_normalizer/

from charset_normalizer import detect
encoding = detect(byte_string)['encoding']

PS: This is not strictly related to the original question but this page comes up in Google a lot

Dawars
  • 57
  • 1
  • 9
0

chardet detected encoding can be used to decode an bytearray without any exception, but the output string may not be correct.

The try ... except ... way works perfectly for known encodings, but it does not work for all scenarios.

We can use try ... except ... first and then chardet as plan B:

    def decode(byte_array: bytearray, preferred_encodings: List[str] = None):
        if preferred_encodings is None:
            preferred_encodings = [
                'utf8',       # Works for most cases
                'cp1252'      # Other encodings may appear in your project
            ]

        for encoding in preferred_encodings:
            # Try preferred encodings first
            try:
                return byte_array.decode(encoding)
            except UnicodeDecodeError:
                pass
        else:
            # Use detected encoding
            encoding = chardet.detect(byte_array)['encoding']
            return byte_array.decode(encoding)

Shawn Hu
  • 349
  • 2
  • 8
0

I tried with both json and chardet, and I got these results:

import json
import chardet

data = b'\xa9 2023'
json.detect_encoding(data)  # 'utf-8'
data.decode('utf-8')  # UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 0: invalid start byte

chardet.detect(data)  # {'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}
data.decode("ISO-8859-1")  # '© 2023'
jakobdo
  • 1,282
  • 14
  • 20