0

I'm reading a file with Python that contains exactly the following line

à è ì ò ù ç @ \U0001F914

where \U0001F914 is the unicode code for an emoticon.

if interpret the string as

string=string.decode('utf-8')

I get:

à è ì ò ù ç @ \U0001F914

while if I interpret as following:

string=string.decode('unicode-escape')

I get:

à è ì ò ù ç @

How can I print instead:

à è ì ò ù ç @

I'm a beginner, so pardon me if my question is stupid, but I can't get it out.

Thanks in advance.

Jacquelyn.Marquardt
  • 602
  • 2
  • 12
  • 30

2 Answers2

1

Maybe it is not the best solution but first you can use encode with 'unicode-escape' instead of decode and you get

data = 'à è ì ò ù ç @ \U0001F914'
print data.encode('unicode-escape')

\xe0 \xe8 \xec \xf2 \xf9 \xe7 @ \\U0001F914

then you have to replace \\ with \ - in Python you will need \\\\ and \\

data = 'à è ì ò ù ç @ \U0001F914'
print data.encode('unicode-escape').replace('\\\\', '\\')

\xe0 \xe8 \xec \xf2 \xf9 \xe7 @ \U0001F914

and then you can use your decode with 'unicode-escape'

data = 'à è ì ò ù ç @ \U0001F914'
print data.encode('unicode-escape').replace('\\\\', '\\').decode('unicode-escape')

à è ì ò ù ç @ 

EDIT:

It seems you have to add .decode('utf-8') at the beginning

#-*- coding: utf-8 -*-

data = 'à è ì ò ù ç @ \U0001F914'.decode('utf-8')

result = data.encode('unicode-escape').replace('\\\\', '\\').decode('unicode-escape')

print result  #.encode('utf-8')
furas
  • 134,197
  • 12
  • 106
  • 148
  • I get `UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)` – Jacquelyn.Marquardt Nov 16 '16 at 20:15
  • which code - first, second or third. Assing result to variable and then print variable to see if problem is with encode/decode or with print. your console/terminal may use unknow encodind and python doesn't know how to encode text (when you use `print()`) so it use `ascii` encoding. – furas Nov 16 '16 at 20:20
  • my (your) code as it is written in my editor: `# -*- coding: utf-8 -*-` `data = 'à è ì ò ù \U0001F914'` `print data.encode('unicode-escape')` `print data.encode('unicode-escape').replace('\\\\', '\\')` `print data.encode('unicode-escape').replace('\\\\', '\\').decode('unicode-escape')` – Jacquelyn.Marquardt Nov 16 '16 at 20:37
  • And I get error at line 3 where it says `print data.encode('unicode-escape')` Error is ` File "sticker.py", line 3, in print data.encode('unicode-escape') UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)` – Jacquelyn.Marquardt Nov 16 '16 at 20:39
  • you need only `data=` and last `print`. Problem is console/terminal which doesn't inform Python what endcoding it uses so `print` use `.encode('ascii')` as default. You may use `.encode('utf-8')` if your console/terminal use `utf-8` to display text. If you use WIndows then you probably will need `.encode('cp1250')` or similar. – furas Nov 16 '16 at 20:44
  • If I write in my source code file only data= and last print I get nevertheless `UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)` – Jacquelyn.Marquardt Nov 16 '16 at 20:50
  • Can you reproduce your solution on your machine? – Jacquelyn.Marquardt Nov 16 '16 at 20:50
  • if I limit myself at print data I get `à è ì ò ù \U0001F914` – Jacquelyn.Marquardt Nov 16 '16 at 20:52
  • take note that in the file I'm trying to read it is not present an emoticon when I open it. instead it contains a \ then a U then a 0...... then a 1 then an F.. which are Ascii. I want to interpret the sequence \U0001F914 as an emoticon but when I successfully do it, then I can't read the accented characters.... – Jacquelyn.Marquardt Nov 16 '16 at 20:55
  • I run Linux Mint (based on Ubuntu 14.04) and I use Python 2.7.12 and I don't have this problem when I run `python script.py` in bash console – furas Nov 16 '16 at 21:00
  • Try `data = 'à è ì ò ù ç @ \U0001F914'.decode('utf-8')` – furas Nov 16 '16 at 21:05
  • THANKS! It works! Accepted as right answer, upvoted and I want to hug you via internet ;). You made my day. Next time I promise I will study the reference instead of copy-pasting without knowing. thanks again. – Jacquelyn.Marquardt Nov 16 '16 at 21:19
  • `decode/encode` always makes problem :) I send you `\U0001F914` - whatever it is :) – furas Nov 16 '16 at 21:26
0

\U0001F914 is outside of the printable range for IDLE, Tk, and most terminals.