0

I'm trying to read the text file but it throws one error.

UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 12416-12416: Non-BMP character not supported in Tk

I've tried to ignore it also but it I not working. Here is the code:

with io.open('reviews1.txt', mode='r',encoding='utf-8') as myfile:
document1=myfile.read().replace('\n', '')
print(document1)
  • Try [`surrogateescape` error handler](http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html#unicode-error-handlers)? Nevertheless, please [edit] your question and show full traceback. – JosefZ Jul 07 '17 at 08:39
  • The problem is not with reading the file (that would be a **de**coding error). It's with the `print` expression: your environment is apparently unable to process characters beyond the [BMP](https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane), such as emoticons. Is writing to a file instead an option? – lenz Jul 07 '17 at 09:08
  • I can reproduce the error in Python 3.5 IDLE environment. However, the script runs smoothly from a console (Windows `cmd`, in my case). @lenz is right, the error is related to `print`. – JosefZ Jul 07 '17 at 11:23
  • Yes but how to overcome from it? @JosefZ – Maitreya Patel Jul 07 '17 at 13:33
  • I have save the data first into same file. And now I am trying to read and print the data of that file. Is there anyway such that we can delete that character while writing or reading? @lenz – Maitreya Patel Jul 07 '17 at 13:35

2 Answers2

0

The problem is not with reading the file (that would be a decoding error). It's with the print expression: your environment is apparently unable to process characters beyond the BMP, such as emoticons.

If you want to print those characters to STDOUT, you can check if your shell/IDE supports an encoding that supports all of Unicode (UTF-8, UTF-16...). Or you switch to a different environment for running the script.

If you want to run it in the same setting, you can encode the data yourself, which gives you the option to specify a custom error handling:

data = document1.encode('UCS-2', errors='replace')
sys.stdout.buffer.write(data)

This will replace unsupported characters as ? or some other character. You can also specify errors='ignore', which will suppress the characters.

I couldn't test this, though, because my codecs library doesn't know the UCS-2 encoding. It's an obsolete standard used by Windows until NT.

lenz
  • 5,658
  • 5
  • 24
  • 44
-1

I can reproduce the error in Python IDLE environment (Python version 3.5.1, Tk version 8.6.4, IDLE version 3.5.1). It seems to be a bug in Tk. However, the original script runs smoothly from a console (Windows cmd, in my case): Python 3.5.1 (v3.5.1:37a07cee5969, Dec 6 2015, 01:54:25) [MSC v.1900 64 bit (AMD64)] on win32.

The only way I can see could be very slow: the following commented script copies the whole document character by character eliminating all ones out of the Basic Multilingual Plane.

Edit: I found this (more Python-ish) solution (thanks to Mark Ransom). Unfortunately, this runs in Python shell but Python console complains:

>>> print( ''.join(c if c <= '\uffff' else ''.join(chr(x) for x in struct.unpack(
...   '>2H', c.encode('utf-16be'))) for c in document1)
... )
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "C:\Python\Python35\lib\site-packages\win_unicode_console\streams.py",
line 179, in write

    return self.base.write(s)
UnicodeEncodeError: 'utf-16-le' codec can't encode character '\ud83d' in position 0: 
surrogates not allowed
>>>

-

# -*- coding: utf-8 -*-

import sys, io
import os, codecs                       # for debugging

print(os.path.basename(sys.executable), sys.argv[0], '\n') # for debugging

#######################
### original answer ###
#######################
filepath = 'D:\\test\\reviews1.txt'
with io.open(filepath, mode='r',encoding='utf-8') as myfile:
    document1=myfile.read() #.replace('\n', '')
    document2=u''
    for character in document1:
        ordchar = ord(character)
        if ordchar <= 0xFFFF:
            # debugging # print( 'U+%.4X' % ordchar, character)
            document2+=character
        else:
            # debugging # print( 'U+%.6X' % ordchar, '�')
            ###         �=Replacement Character; codepoint=U+FFFD; utf8=0xEFBFBD
            document2+='�'
print(document2)                        # original answer, runs universally

######################
### updated answer ###
######################
if os.path.basename(sys.executable) == 'pythonw.exe':    
    import struct
    document3 = ''.join(c if c <= '\uffff' else ''.join(chr(x) for x in struct.unpack('>2H', c.encode('utf-16be'))) for c in document1)
    print(document3)                    # Pythonw shell
else:
    print(document1)                    # Python console

Output, Pythonw shell:

================== RESTART: D:/test/Python/Py/q44965129a.py ==================
pythonw.exe D:/test/Python/Py/q44965129a.py 

� smiling face with smiling eyes �
� smiling face with open mouth   �
� angry face                     �

 smiling face with smiling eyes 
 smiling face with open mouth   
 angry face                     

>>>

Output, Python console:

==> D:\test\Python\Py\q44965129a.py
python.exe D:\test\Python\Py\q44965129a.py

� smiling face with smiling eyes �
� smiling face with open mouth   �
� angry face                     �

 smiling face with smiling eyes 
 smiling face with open mouth   
 angry face                     

==>
JosefZ
  • 28,460
  • 5
  • 44
  • 83
  • Why use `io.open` when built-in `open` does the same? Why iterating character-wise with manual checking, when `str.encode` provides the `errors='replace'` schema, which does just that for you? Don't reinvent the wheel... – lenz Jul 08 '17 at 13:58
  • @lenz **1**. I _know_ that [`io.open` is an alias for the builtin open() function](https://docs.python.org/3.5/library/io.html). It's the OP's design… **2**. Did you try to apply **any** error handlers? Apparently, you didn't: you _couldn't test this, though_ ([sic](https://en.wikipedia.org/wiki/Sic)!) – JosefZ Jul 09 '17 at 07:30
  • Oh, I didn't see you took `io.open` from the OP – sorry for blaming you for that. About testing: yes, it's a pity. According to the OP's error message, there must be a Python version/implementation that has a `UCS-2` codec; maybe it's available on Windows only. Were you able to run my suggestion? – lenz Jul 09 '17 at 18:35