Print Unicode string containing both accented characters and emoticons

Question

I'm reading a file with Python that contains exactly the following line

à è ì ò ù ç @ \U0001F914

where \U0001F914 is the unicode code for an emoticon.

if interpret the string as

string=string.decode('utf-8')

I get:

à è ì ò ù ç @ \U0001F914

while if I interpret as following:

string=string.decode('unicode-escape')

I get:

Ã Ã¨ Ã¬ Ã² Ã¹ Ã§ @

How can I print instead:

à è ì ò ù ç @

I'm a beginner, so pardon me if my question is stupid, but I can't get it out.

Thanks in advance.

What happens if you just `print string`? How about `print repr(string)`? — Mark Ransom, Nov 16 '16 at 20:23
if I print repr(string) ` '\xc3\xa0 \xc3\xa8 \xc3\xac \xc3\xb2 \xc3\xb9 \\U0001F914' ` — Jacquelyn.Marquardt, Nov 16 '16 at 20:32
print repr(string) `'\xc3\xa0 \xc3\xa8 \xc3\xac \xc3\xb2 \xc3\xb9 \\U0001F914'` — Jacquelyn.Marquardt, Nov 16 '16 at 20:32

furas · Accepted Answer · 2016-11-16T21:10:26.360

1

Maybe it is not the best solution but first you can use encode with 'unicode-escape' instead of decode and you get

data = 'à è ì ò ù ç @ \U0001F914'
print data.encode('unicode-escape')

\xe0 \xe8 \xec \xf2 \xf9 \xe7 @ \\U0001F914

then you have to replace \\ with \ - in Python you will need \\\\ and \\

data = 'à è ì ò ù ç @ \U0001F914'
print data.encode('unicode-escape').replace('\\\\', '\\')

\xe0 \xe8 \xec \xf2 \xf9 \xe7 @ \U0001F914

and then you can use your decode with 'unicode-escape'

data = 'à è ì ò ù ç @ \U0001F914'
print data.encode('unicode-escape').replace('\\\\', '\\').decode('unicode-escape')

à è ì ò ù ç @

EDIT:

It seems you have to add .decode('utf-8') at the beginning

#-*- coding: utf-8 -*-

data = 'à è ì ò ù ç @ \U0001F914'.decode('utf-8')

result = data.encode('unicode-escape').replace('\\\\', '\\').decode('unicode-escape')

print result  #.encode('utf-8')

edited Nov 16 '16 at 21:10

answered Nov 16 '16 at 19:58

furas

134,197
12
106
148

I get `UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)` – Jacquelyn.Marquardt Nov 16 '16 at 20:15
which code - first, second or third. Assing result to variable and then print variable to see if problem is with encode/decode or with print. your console/terminal may use unknow encodind and python doesn't know how to encode text (when you use `print()`) so it use `ascii` encoding. – furas Nov 16 '16 at 20:20
my (your) code as it is written in my editor: `# -*- coding: utf-8 -*-` `data = 'à è ì ò ù \U0001F914'` `print data.encode('unicode-escape')` `print data.encode('unicode-escape').replace('\\\\', '\\')` `print data.encode('unicode-escape').replace('\\\\', '\\').decode('unicode-escape')` – Jacquelyn.Marquardt Nov 16 '16 at 20:37
And I get error at line 3 where it says `print data.encode('unicode-escape')` Error is ` File "sticker.py", line 3, in print data.encode('unicode-escape') UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)` – Jacquelyn.Marquardt Nov 16 '16 at 20:39
you need only `data=` and last `print`. Problem is console/terminal which doesn't inform Python what endcoding it uses so `print` use `.encode('ascii')` as default. You may use `.encode('utf-8')` if your console/terminal use `utf-8` to display text. If you use WIndows then you probably will need `.encode('cp1250')` or similar. – furas Nov 16 '16 at 20:44
If I write in my source code file only data= and last print I get nevertheless `UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)` – Jacquelyn.Marquardt Nov 16 '16 at 20:50
Can you reproduce your solution on your machine? – Jacquelyn.Marquardt Nov 16 '16 at 20:50
if I limit myself at print data I get `à è ì ò ù \U0001F914` – Jacquelyn.Marquardt Nov 16 '16 at 20:52
take note that in the file I'm trying to read it is not present an emoticon when I open it. instead it contains a \ then a U then a 0...... then a 1 then an F.. which are Ascii. I want to interpret the sequence \U0001F914 as an emoticon but when I successfully do it, then I can't read the accented characters.... – Jacquelyn.Marquardt Nov 16 '16 at 20:55
I run Linux Mint (based on Ubuntu 14.04) and I use Python 2.7.12 and I don't have this problem when I run `python script.py` in bash console – furas Nov 16 '16 at 21:00
Try `data = 'à è ì ò ù ç @ \U0001F914'.decode('utf-8')` – furas Nov 16 '16 at 21:05
THANKS! It works! Accepted as right answer, upvoted and I want to hug you via internet ;). You made my day. Next time I promise I will study the reference instead of copy-pasting without knowing. thanks again. – Jacquelyn.Marquardt Nov 16 '16 at 21:19
`decode/encode` always makes problem :) I send you `\U0001F914` - whatever it is :) – furas Nov 16 '16 at 21:26

score 0 · Answer 2 · answered Nov 16 '16 at 19:57

0

\U0001F914 is outside of the printable range for IDLE, Tk, and most terminals.

answered Nov 16 '16 at 19:57

Nicholas Jones

63
4

My terminal (bash) correctly interpret \U0001F914 as an emoticon and displays it correctly on screen. – Jacquelyn.Marquardt Nov 16 '16 at 20:16

Print Unicode string containing both accented characters and emoticons

2 Answers2