I receive files from an undocumented resource that can contain data that looks like:
16058637149881541301278JA1コノマンガガスゴイヘンシュウブ4
#recordsWritten:1293462
The above is just an example, the files I'm working with contain all kinds of different languages (and thus encodings). I'm then opening my file with Python 3.6 (an inherited code base that I've upped from Python 2 to Python 3) using the following code:
import os
f = open(file_path, "r")
f.seek(0, os.SEEK_END)
f.seek(f.tell() -40, os.SEEK_SET)
records_str = f.read()
print(records_str)
Using this code, I receive a: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 0: invalid start byte
if I change it to include an encoding:
f = open(file_path, "r", encoding='utf-8')
, I receive the same error.
Changing the encoding to utf-16
results in it printing:
랂菣Ꚃ菣Ɩȴ⌊敲潣摲坳楲瑴湥ㄺ㤲㐳㈶ਂ
Which appears to be wrong.
Switching it to open the file in binary mode: f = open(file_path, "rb")
results in it outputting:
b'\x82\xb7\xe3\x83\xa5\xe3\x82\xa6\xe3\x83\x96\x014\x02\n#recordsWritten:1293462\x02\n'
Now this is slightly better, however, when I eventually come to processing the file, I don't want to be adding \x82\xb7\xe3\x83\xa5\
to my database, I'd rather add the ガガスゴイヘンシ
. So, is there a way to handle Unicode encoded files? I've also looked at the Mozilla chardet project to try and determine encoding, but following code examples, it thinks the file is utf-8 encoded.