5

I want to read all files from a folder (with os.walk) and convert them to one encoding (UTF-8). The problem is those files don't have same encoding. They could be UTF-8, UTF-8 with BOM, UTF-16.

Is there any way to do read those files without knowing their encoding?

Sam Black
  • 371
  • 5
  • 19
  • 1
    In the most general sense, no. But you can use various heuristics to have a good go at it, it's very dependant on your specific data set. – Tom Dalton Dec 23 '15 at 03:25

2 Answers2

6

You can read those files in binary mode. Also, the chardet library can help you detect character encoding. Using chardet, you can detect the encoding of your files and decode the data you get. Though this module has limitations.

As an example:

from chardet import detect

with open('your_file.txt', 'rb') as ef:
    detect(ef.read())
blong
  • 2,815
  • 8
  • 44
  • 110
0

If it is indeed always one of these 3 then it is easy. If you can read the file using UTF-8 then it is probably UTF-8. Otherwise it will be UTF-16. Python can also automatically discard the BOM if present.

You can use a try ... except block to try both:

try:
    tryToConvertMyFile(from, to, 'utf-8-sig')
except UnicodeDecodeError:
    tryToConvertMyFile(from, to, 'utf-16')

If other encodings are present as well (like ISO-8859-1) then forget it, there is no 100% reliable way of figuring out the encoding. But you can guess—see for example Is there a Python library function which attempts to guess the character-encoding of some bytes?

Community
  • 1
  • 1
roeland
  • 5,349
  • 2
  • 14
  • 28
  • @ClaytonWahlstrom yes, that is also what the linked question says. But for this simple case it is not necessary. – roeland Dec 23 '15 at 03:31