1

I'm working from an OpenOffice produced .csv with mixed roman and Chinese characters. This is an example of one row:

b'\xe5\xbc\x80\xe5\xbf\x83'b'K\xc4\x81i x\xc4\xabn'b'Open heart 'b'Happy '

This section contains two Chinese characters stored in binary which I would like displayed as Chinese characters on the command line from a very basic Python 3 program (see bottom), how do I do this?

b'\xe5\xbc\x80\xe5\xbf\x83'b'K\xc4\x81i x\xc4\xabn'

When I open the .csv in OpenOffice I need to select "Chinese Simplified UEC-CN" as the Character set if that helps. I have searched extensively but I do not understand Unicode and the pages do not make sense.

import csv
f = open('Chinese.csv', encoding="utf-8") 
file = csv.reader(f)

for line in file:
    for word in line:
        print(word.encode('utf-8'), end='')
    print("\n")

Thank you in advance for any suggestions.

sshashank124
  • 31,495
  • 9
  • 67
  • 76
Inyoka
  • 1,287
  • 16
  • 24
  • yes, I tried .decode but didn't have any luck. It gave an error about not having a decode method on the 'word' then another saying 'codec can't decode byte 0xce in position 1'. So your output shows Chinese characters? – Inyoka May 14 '14 at 05:06
  • "AttributeError: 'str' object has no attribute 'decode'". FYI I have added the Chinese character sets to the terminal (I'm using a Mac). – Inyoka May 14 '14 at 05:14
  • well, yes, if you already have `str` instances then they're already decoded into unicode, no need to decode them further obviously. Silly question: what does `print(word)` do? – roippi May 14 '14 at 05:18
  • print(word) gives "UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-1: character maps to " – Inyoka May 14 '14 at 05:29
  • 1
    That is an issue with your terminal not telling python that it can print `utf-8` characters. see http://stackoverflow.com/questions/492483/setting-the-correct-encoding-when-piping-stdout-in-python – roippi May 14 '14 at 05:34
  • @eryksun >>> sys.stdout.encoding gives ... 'UTF-8' – Inyoka May 14 '14 at 06:15
  • @roippi http://stackoverflow.com/questions/492483/setting-the-correct-encoding-when-piping-stdout-in-python refers to Python2.* – Inyoka May 14 '14 at 06:16
  • @eryksun thank you, it wasn't the terminal, it was the source code file. The question is different but the solution is here : http://stackoverflow.com/a/542899/792015 I read tens of articles and nowhere does it say the source needs saving in UTF-8 but it makes sense. I don't have enough reputation to post as an answer. – Inyoka May 14 '14 at 06:41
  • FYI once source code is printed in UTF-8 the following works... **print(word)** – Inyoka May 14 '14 at 06:42
  • @eryksun the encoding was **text/x-python; charset=us-ascii** before now it is **text/x-java; charset=us-ascii** ?? according to __file -I__ What you said makes sense, and it looks like it is still ASCII despite what Eclipse told me. However it is working and the code is the same. – Inyoka May 14 '14 at 07:09
  • @eryksun Thank you, that makes sense. Don't know why this issue occurred but it definitely doesn't like the source code in a plain 7-BIT ASCII file. – Inyoka May 14 '14 at 07:28

1 Answers1

0

Thanks to a suggestion by @eryksun I solved my issue by re-encoding the source file to UTF-8 from ASCII. The question is different but the solution is here :

http://www.stackoverflow.com/a/542899/792015

Alternatively if you are using Eclipse you can paste a non roman character (such as a Chinese character like ) into your source code and save the file. If the source is not already UTF-8 Eclipse will offer to change it for you.

Thank you for all your suggestions and my apologies for answering my own question.

Footnote : If anyone knows why changing the source file type effects the compiled program I would love to know. According to https://docs.python.org/3/tutorial/interpreter.html the interpreter treats source files as UTF-8 by default.

Inyoka
  • 1,287
  • 16
  • 24