Best way to detect (utf) encoding in unknown file

Question

Here is what I'm currently using to open a various file that the user has:

# check the encoding quickly
with open(file, 'rb') as fp:
    start_data = fp.read(4)
    if start_data.startswith(b'\x00\x00\xfe\xff'):
        encoding = 'utf-32'
    elif start_data.startswith(b'\xff\xfe\x00\x00'):
        encoding = 'utf-32'
    elif start_data.startswith(b'\xfe\xff'):
        encoding = 'utf-16'
    elif start_data.startswith(b'\xff\xfe'):
        encoding = 'utf-16'
    else:
        encoding = 'utf-8'            

# open the file with that encoding
with open(file, 'r', encoding=encoding) as fp:
    do_something()

Would there be a better way than the above to properly open an unknown utf file?

score 0 · Answer 1 · answered Dec 19 '18 at 20:32

If you know it is utf, you could use chardet to do something like:

from chardet.universaldetector import UniversalDetector

detector = UniversalDetector()

with open(file, 'rb') as fp:
    detector.feed(fp.read(1000))
    detector.close()
    raw = detector.result['encoding'].lower()
    encoding = 'utf-32' if ('utf-32' in raw) else 'utf-16' if ('utf-16' in raw) else 'utf-8'

Note: trying magic or some of the other libraries mentioned in the question here Determine the encoding of text in Python did not work. Additionally, note that a lot of times the file is in utf-8 it will be marked as ascii.

Best way to detect (utf) encoding in unknown file

1 Answers1