I can reproduce the error in Python IDLE environment (Python version 3.5.1, Tk version 8.6.4, IDLE version 3.5.1
). It seems to be a bug in Tk
. However, the original script runs smoothly from a console (Windows cmd
, in my case): Python 3.5.1 (v3.5.1:37a07cee5969, Dec 6 2015, 01:54:25) [MSC v.1900 64 bit (AMD64)] on win32
.
The only way I can see could be very slow: the following commented script copies the whole document character by character eliminating all ones out of the Basic Multilingual Plane.
Edit: I found this (more Python-ish) solution (thanks to Mark Ransom). Unfortunately, this runs in Python shell but Python console complains:
>>> print( ''.join(c if c <= '\uffff' else ''.join(chr(x) for x in struct.unpack(
... '>2H', c.encode('utf-16be'))) for c in document1)
... )
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "C:\Python\Python35\lib\site-packages\win_unicode_console\streams.py",
line 179, in write
return self.base.write(s)
UnicodeEncodeError: 'utf-16-le' codec can't encode character '\ud83d' in position 0:
surrogates not allowed
>>>
-
# -*- coding: utf-8 -*-
import sys, io
import os, codecs # for debugging
print(os.path.basename(sys.executable), sys.argv[0], '\n') # for debugging
#######################
### original answer ###
#######################
filepath = 'D:\\test\\reviews1.txt'
with io.open(filepath, mode='r',encoding='utf-8') as myfile:
document1=myfile.read() #.replace('\n', '')
document2=u''
for character in document1:
ordchar = ord(character)
if ordchar <= 0xFFFF:
# debugging # print( 'U+%.4X' % ordchar, character)
document2+=character
else:
# debugging # print( 'U+%.6X' % ordchar, '�')
### �=Replacement Character; codepoint=U+FFFD; utf8=0xEFBFBD
document2+='�'
print(document2) # original answer, runs universally
######################
### updated answer ###
######################
if os.path.basename(sys.executable) == 'pythonw.exe':
import struct
document3 = ''.join(c if c <= '\uffff' else ''.join(chr(x) for x in struct.unpack('>2H', c.encode('utf-16be'))) for c in document1)
print(document3) # Pythonw shell
else:
print(document1) # Python console
Output, Pythonw shell:
================== RESTART: D:/test/Python/Py/q44965129a.py ==================
pythonw.exe D:/test/Python/Py/q44965129a.py
� smiling face with smiling eyes �
� smiling face with open mouth �
� angry face �
smiling face with smiling eyes
smiling face with open mouth
angry face
>>>
Output, Python console:
==> D:\test\Python\Py\q44965129a.py
python.exe D:\test\Python\Py\q44965129a.py
� smiling face with smiling eyes �
� smiling face with open mouth �
� angry face �
smiling face with smiling eyes
smiling face with open mouth
angry face
==>