1

I need to process some Excel files which contains lots of "−" ('\u2212'), as well as other characters. After lots of trying, I can't even print it on screen, or save it to a file:

a='−'
print(a.encode('utf-8')) # print b'\xe2\x88\x92'
print(a)     # raise UnicodeEncodeError: 'gbk' codec can't encode character '\u2212' in position 0: illegal multibyte sequence
with open('test.txt','w') as file:
    file.write(a)      # raise UnicodeEncodeError: 'gbk' codec can't encode character '\u2212' in position 0: illegal multibyte sequence

In this page: https://docs.python.org/3.4/howto/unicode.html, it replace it with some other characters, but I have to print it, or at least write it to a file properly:

>>> u = chr(40960) + 'abcd' + chr(1972)
>>> u.encode('utf-8')
b'\xea\x80\x80abcd\xde\xb4'
>>> u.encode('ascii')  
Traceback (most recent call last):
    ...
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in
  position 0: ordinal not in range(128)
>>> u.encode('ascii', 'ignore')
b'abcd'
>>> u.encode('ascii', 'replace')
b'?abcd?'
>>> u.encode('ascii', 'xmlcharrefreplace')
b'ꀀabcd޴'
>>> u.encode('ascii', 'backslashreplace')
b'\\ua000abcd\\u07b4'

How can I do it?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
liyuanhe211
  • 671
  • 6
  • 15
  • Start with a console that can actually handle a wider range of Unicode characters. Your console is configured for GBK only. – Martijn Pieters Jul 29 '15 at 20:57
  • 1
    To write to a file, specify a codec that can handle the specific Unicode codepoints. `open('test.txt', 'w', encoding='utf8')` for example. – Martijn Pieters Jul 29 '15 at 20:57
  • @MartijnPieters I did use `-m idlelib -r` in the interpreter option to pop up a IDLE shell, which temporarily fix this problem, but is it possible to print it in pycharm console? – liyuanhe211 Jul 29 '15 at 21:39
  • See https://www.jetbrains.com/pycharm/help/configuring-output-encoding.html for the PyCharm console configuration. – Martijn Pieters Jul 30 '15 at 07:10

0 Answers0