Is there a better way to handle file encoding in python?

Question

I have some text files with different unknown encoding. Now I have to open a file as binary to detect the encoding first, and open it again with that encoding.

  bf = open(f, 'rb')
  code = chardet.detect(bf.read())['encoding']
  print(f + ' : ' + code)
  bf.close()
  with open(f, 'r', encoding=code) as source:
    texts = extractText(source.readlines())
  source.close()  
  with open(splitext(f)[0] + '_texts.txt', 'w', encoding='utf-8') as dist:
    dist.write('\n\n'.join('\n'.join(x) for x in texts))
  dist.close()

So is there a better way handle this problem?

Look at this link. Might be useful to what you are looking for. https://stackoverflow.com/questions/18263136/how-to-deal-with-unknown-encoding-when-scraping-webpages — Bhadresh Dhanani, Sep 13 '17 at 16:47
@EricDuminil It is some files for different softwares. There is no way to guess the encodings. — sfy, Sep 13 '17 at 16:52

score 2 · Accepted Answer · answered Sep 13 '17 at 16:47

2

Instead of reopening and rereading the file, you could just decode the text you already read:

with open(filename, 'rb') as fileobj:
    binary = fileobj.read()
probable_encoding = chardet.detect(binary)['encoding']
text = binary.decode(probable_encoding)

answered Sep 13 '17 at 16:47

user2357112

260,549
28
431
505

Is there a better way to handle file encoding in python?

1 Answers1