Read file with Python without knowing encoding

Question

I want to read all files from a folder (with os.walk) and convert them to one encoding (UTF-8). The problem is those files don't have same encoding. They could be UTF-8, UTF-8 with BOM, UTF-16.

Is there any way to do read those files without knowing their encoding?

In the most general sense, no. But you can use various heuristics to have a good go at it, it's very dependant on your specific data set. — Tom Dalton, Dec 23 '15 at 03:25

score 6 · Accepted Answer · edited Jun 27 '23 at 14:26

6

You can read those files in binary mode. Also, the chardet library can help you detect character encoding. Using chardet, you can detect the encoding of your files and decode the data you get. Though this module has limitations.

As an example:

from chardet import detect

with open('your_file.txt', 'rb') as ef:
    detect(ef.read())

edited Jun 27 '23 at 14:26

blong

2,815
8
44
110

answered Dec 23 '15 at 03:29

Andrey Mylnikov

86
6

Thanks Andrey. That does help. – Sam Black Dec 23 '15 at 03:53

score 0 · Answer 2 · edited May 23 '17 at 12:07

If it is indeed always one of these 3 then it is easy. If you can read the file using UTF-8 then it is probably UTF-8. Otherwise it will be UTF-16. Python can also automatically discard the BOM if present.

You can use a try ... except block to try both:

try:
    tryToConvertMyFile(from, to, 'utf-8-sig')
except UnicodeDecodeError:
    tryToConvertMyFile(from, to, 'utf-16')

If other encodings are present as well (like ISO-8859-1) then forget it, there is no 100% reliable way of figuring out the encoding. But you can guess—see for example Is there a Python library function which attempts to guess the character-encoding of some bytes?

@ClaytonWahlstrom yes, that is also what the linked question says. But for this simple case it is not necessary. — roeland, Dec 23 '15 at 03:31

Read file with Python without knowing encoding

2 Answers2