0

I don't understand what Python 3.3 on Windows 8 is doing with Unicode. The file t.txt contains three bytes, hex values e2, 80, 04, which is the utf-8 representation of an em dash. I would expect the following code to display that character; I don't understand why it is not, or why cp850.py is involved. Can anyone explain what is going on, and what I need to do to read Unicode from a text file? I'm too confused to ask a clearer question.

>python
Python 3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:06:53) [MSC v.1600 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> f = open( 't.txt', encoding='utf-8' )
>>> s = f.readline()
>>> print(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Program Files\Python33\lib\encodings\cp850.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2014' in position 0: character maps to <undefined>
>>>
>>> import sys
>>> sys.getfilesystemencoding()
'mbcs'
>>> sys.getdefaultencoding()
'utf-8'
user184411
  • 37
  • 1
  • 6
  • `print(s.decode("utf-8")) ` – mishik Apr 07 '15 at 04:16
  • No, it has already been decoded. If anything, you want to encode for output. – tripleee Apr 07 '15 at 04:25
  • There's no reason to use `codecs.getwriter` in Python 3. Just create a new `io.TextIOWrapper` instance, and remember to detach the buffer from the original wrapper, or you may be in for a surprise when it gets deallocated and closes the stream out from under you. – Eryk Sun Apr 07 '15 at 10:08

0 Answers0