Converting binary stored Unicode Chinese Characters back to Unicode using Python 3

Question

I'm working from an OpenOffice produced .csv with mixed roman and Chinese characters. This is an example of one row:

b'\xe5\xbc\x80\xe5\xbf\x83'b'K\xc4\x81i x\xc4\xabn'b'Open heart 'b'Happy '

This section contains two Chinese characters stored in binary which I would like displayed as Chinese characters on the command line from a very basic Python 3 program (see bottom), how do I do this?

b'\xe5\xbc\x80\xe5\xbf\x83'b'K\xc4\x81i x\xc4\xabn'

When I open the .csv in OpenOffice I need to select "Chinese Simplified UEC-CN" as the Character set if that helps. I have searched extensively but I do not understand Unicode and the pages do not make sense.

import csv
f = open('Chinese.csv', encoding="utf-8") 
file = csv.reader(f)

for line in file:
    for word in line:
        print(word.encode('utf-8'), end='')
    print("\n")

Thank you in advance for any suggestions.

yes, I tried .decode but didn't have any luck. It gave an error about not having a decode method on the 'word' then another saying 'codec can't decode byte 0xce in position 1'. So your output shows Chinese characters? — Inyoka, May 14 '14 at 05:06
"AttributeError: 'str' object has no attribute 'decode'". FYI I have added the Chinese character sets to the terminal (I'm using a Mac). — Inyoka, May 14 '14 at 05:14
well, yes, if you already have `str` instances then they're already decoded into unicode, no need to decode them further obviously. Silly question: what does `print(word)` do? — roippi, May 14 '14 at 05:18
print(word) gives "UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-1: character maps to " — Inyoka, May 14 '14 at 05:29
That is an issue with your terminal not telling python that it can print `utf-8` characters. see http://stackoverflow.com/questions/492483/setting-the-correct-encoding-when-piping-stdout-in-python — roippi, May 14 '14 at 05:34
@roippi http://stackoverflow.com/questions/492483/setting-the-correct-encoding-when-piping-stdout-in-python refers to Python2.* — Inyoka, May 14 '14 at 06:16
@eryksun thank you, it wasn't the terminal, it was the source code file. The question is different but the solution is here : http://stackoverflow.com/a/542899/792015 I read tens of articles and nowhere does it say the source needs saving in UTF-8 but it makes sense. I don't have enough reputation to post as an answer. — Inyoka, May 14 '14 at 06:41
FYI once source code is printed in UTF-8 the following works... **print(word)** — Inyoka, May 14 '14 at 06:42
@eryksun the encoding was **text/x-python; charset=us-ascii** before now it is **text/x-java; charset=us-ascii** ?? according to __file -I__ What you said makes sense, and it looks like it is still ASCII despite what Eclipse told me. However it is working and the code is the same. — Inyoka, May 14 '14 at 07:09
@eryksun Thank you, that makes sense. Don't know why this issue occurred but it definitely doesn't like the source code in a plain 7-BIT ASCII file. — Inyoka, May 14 '14 at 07:28

score 0 · Answer 1 · answered May 15 '14 at 04:51

Thanks to a suggestion by @eryksun I solved my issue by re-encoding the source file to UTF-8 from ASCII. The question is different but the solution is here :

http://www.stackoverflow.com/a/542899/792015

Alternatively if you are using Eclipse you can paste a non roman character (such as a Chinese character like 大) into your source code and save the file. If the source is not already UTF-8 Eclipse will offer to change it for you.

Thank you for all your suggestions and my apologies for answering my own question.

Footnote : If anyone knows why changing the source file type effects the compiled program I would love to know. According to https://docs.python.org/3/tutorial/interpreter.html the interpreter treats source files as UTF-8 by default.

Converting binary stored Unicode Chinese Characters back to Unicode using Python 3

1 Answers1