Python 3: reading UCS-2 (BE) file

Question

I can't seem to be able to decode UCS-2 BE files (legacy stuff) under Python 3.3, using the built-in open() function (stack trace shows UnicodeDecodeError and contains my readLine() method) - in fact, I wasn't able to find a flag for specifying this encoding.

Using Windows 8, terminal is set to codepage 65001, using 'Lucida Console' fonts.

Code snippet won't be of too much help, I guess:

def display_resource():
    f = open(r'D:\workspace\resources\JP.res', encoding=<??tried_several??>)
    while True:
        line = f.readline()
        if len(line) == 0:
            break

Appreciating any insight into this issue.

Martijn Pieters · Accepted Answer · 2016-07-20T21:45:15.737

36

UCS-2 is UTF-16, really, for any codepoint that was assigned when it was still called UCS-2 in any case.

Open it with encoding='utf16'. If there is no BOM (the Byte order mark, 2 bytes at the start, for BE that'd be \xfe\xff), then use encoding='utf_16_be' to force a byte order.

edited Jul 20 '16 at 21:45

answered Jan 23 '13 at 20:10

Martijn Pieters

1,048,767
296
4,058
3,343

Hello Martijn, I also thought UTF16 should work (based on the same article you linked). And it works, but, just as with utf_16_be, I get on the screen the same character for all Japanese letters - for example "ブラウザー" becomes just a bunch of the same, "unreadable" characters (squares). I should have, again, made the distinction between the two - reading the line, and printing it. Is this also a limitation of the terminal? Going forward, if the reading works fine, and I can work with the strings, can I then write them back to another UCS2 file and get the "right" output in an UCS2-enabledEditor? – elder elder Jan 23 '13 at 20:28
1

It's a limitation of the terminal, I am afraid. Your font does not support those characters; you'll have to find a different font that does. Just because the terminal cannot display them doesn't mean that the data itself has been damaged, so yes, if you encode back to UTF-16 when you write to the file you can open it again with other tools. – Martijn Pieters Jan 23 '13 at 20:30
Just wanted to add that I found another limitation of the Lucida Console, maybe it will help someone in the future: when displaying Japanese, Chinese, Arab, Russian, Romanian characters, it will sometimes repeat the last characters from a line - sometimes only the newline, other times as many as 7 - 8 characters. This behavior seems random. Writing to a file these lines, they will show up just right (using the proper encoding - UTF16 in my case). – elder elder Jan 24 '13 at 10:32
1

@elderelder: That'd be a Windows console or font problem indeed. – Martijn Pieters Jan 24 '13 at 10:37

Python 3: reading UCS-2 (BE) file

1 Answers1

Linked