How to open mixed encoded unicode files in python 3.6?

Question

I receive files from an undocumented resource that can contain data that looks like:

16058637149881541301278JA1コノマンガガスゴイヘンシュウブ4
#recordsWritten:1293462

The above is just an example, the files I'm working with contain all kinds of different languages (and thus encodings). I'm then opening my file with Python 3.6 (an inherited code base that I've upped from Python 2 to Python 3) using the following code:

import os

f = open(file_path, "r")

f.seek(0, os.SEEK_END)
f.seek(f.tell() -40, os.SEEK_SET)
records_str = f.read()
print(records_str)

Using this code, I receive a: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 0: invalid start byte

if I change it to include an encoding:

f = open(file_path, "r", encoding='utf-8'), I receive the same error.

Changing the encoding to utf-16 results in it printing:

랂菣Ꚃ菣Ɩȴ⌊敲潣摲坳楲瑴湥ㄺ㤲㐳㈶ਂ

Which appears to be wrong.

Switching it to open the file in binary mode: f = open(file_path, "rb") results in it outputting:

b'\x82\xb7\xe3\x83\xa5\xe3\x82\xa6\xe3\x83\x96\x014\x02\n#recordsWritten:1293462\x02\n'

Now this is slightly better, however, when I eventually come to processing the file, I don't want to be adding \x82\xb7\xe3\x83\xa5\ to my database, I'd rather add the ガガスゴイヘンシ. So, is there a way to handle Unicode encoded files? I've also looked at the Mozilla chardet project to try and determine encoding, but following code examples, it thinks the file is utf-8 encoded.

Try using [codecs](https://bip.weizmann.ac.il/course/python/PyMOTW/PyMOTW/docs/codecs/index.html#working-with-files), instead of `open`. In any case `import codecs`, `codecs.open(file_path, "rb")`. — Thymen, Jan 05 '21 at 11:03
Data without encoding has no meaning. It's garbage. In your case you can't do more than reading in binary mode and doing an educated guess on chunks, which of cause you would have identify first. — Klaus D., Jan 05 '21 at 11:05
@Thymen not sure that `codecs` offers anything above and beyond the normal `open` function — Jarede, Jan 05 '21 at 11:29
Perhaps see also https://meta.stackoverflow.com/questions/379403/problematic-questions-about-decoding-errors — tripleee, Jan 05 '21 at 12:05
@tripleee I suspect that seeking to the middle of a UTF-8 sequence is what is tripping it up. I don't think i actually need to do that... I just need the last line of the file (or well the `#recordsWritten:1293462` bit) — Jarede, Jan 05 '21 at 12:13

score 1 · Answer 1 · answered Jan 05 '21 at 12:04

Without knowledge of the actual bytes in the file, all we can do is speculate.

If the file is not using a single encoding throughout, there is really no way to process it programmatically. You will have to divide it into sections and separately convert each one using whatever encoding is correct for that sequence. This will almost certainly require manual work, if only to establish the boundaries between sections with different encodings.

Going forward, you will probably want to convert everything to a single encoding; my recommendation for that would be UTF-8. It should be able to accommodate anything you can get Python to recognize as a valid string in the first place.

As a crude example, if you know the example you provided uses plain 7-bit ASCII for the Latin sections and EUC-JP for the Japanese characters, maybe try

with open(filename, 'rb') as filebytes:
    raw_bytes = filebytes.read()
string = raw_bytes[0:26].decode('ascii') + \
    raw_bytes[26:54].decode('euc-jp') + \
    raw_bytes[54:].decode('ascii')

I determined the character ranges experimentally from the string you provided; if I guessed wrong which encoding you used for the Japanese text (in particular) they are probably not correct for your actual data.

Observe how we can read bytes from a filehandle opened with rb and Python will not try to apply any character encoding while reading them. But then of course we have to decode them separately with the correct encoding for each if we want to turn this into a string.

EUC-JP can actually accommodate plain ASCII just fine, so this is a bit artificial. I just wanted to use an example which is thoroughly incompatible with anything Unicode. — tripleee, Jan 05 '21 at 12:29

tripleee · Accepted Answer · 2021-01-09T09:58:09.777

If you seek into the middle of a UTF-8 sequence, the error message doesn't necessarily mean the data isn't actually UTF-8, just that you can't seek to that exact position and get a useful decoding. "Invalid start byte" means this cannot be the beginning of a valid UTF-8 string.

If you only need to retrieve the last line of the file, maybe just read the entire file and pluck off the last line, or use try/ except until you find a position you can safely seek to. Or simply read part or all of the file as bytes and then decode only the last line.

import os

with open(file_path, "rb") as f:  # notice "b" in "rb"
    f.seek(0, os.SEEK_END)
    f.seek(f.tell() -40, os.SEEK_SET)
    records_bytes = f.read()
records_str = records_bytes.split(b'\n')[-2].decode('ascii')
print(records_str)

We use[-2] on the assumption that the file contains a final newline at the end (i.e. it is a well-formed text file) and so [-1] is simply an empty string, and this retrieves the last actual line.

(Posting this as a separate answer so as not to pollute my other answer, which I hope might also be more useful to future visitors.)

So I've taken what you said and managed to work my way to this answer on SO and used a bit of this too: https://stackoverflow.com/a/60416207/127606 meaning I can open the file in `r` mode rather than `rb` mode. — Jarede, Jan 05 '21 at 13:02

How to open mixed encoded unicode files in python 3.6?

2 Answers2