Encoding a unicode string to utf-8 and getting question marks

Question

System:

Python 2.7.5 , IPython 2.3.1 , OSX terminal (local), sys.stdout.encoding : 'UTF-8'
(venv) toz$ locale LANG="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_CTYPE="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_ALL= –

I am expecting both commands print the same but the latter prints question marks for ord(c)>128. Why is that? How can I encode this unicode string and iterate through without getting the question marks?

In [77]: for c in u'!"#%\'()*+,-./:;<=>?@[\]^_`{|}~’”“': print c.encode('utf-8'),
! " # % ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~ ’ ” “

In [78]: for c in u'!"#%\'()*+,-./:;<=>?@[\]^_`{|}~’”“'.encode('utf-8'): print c,
! " # % ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~ ? ? ? ? ? ? ? ? ?

Lets print the ord values:

In [92]: for c in u'!"#%\'()*+,-./:;<=>?@[\]^_`{|}~’”“': print c.encode('utf-8'),ord(c),
! 33 " 34 # 35 % 37 ' 39 ( 40 ) 41 * 42 + 43 , 44 - 45 . 46 / 47 : 58 ; 59 < 60 = 61 > 62 ? 63 @ 64 [ 91 \ 92 ] 93 ^ 94 _ 95 ` 96 { 123 | 124 } 125 ~ 126 ’ 8217 ” 8221 “ 8220

In [93]: for c in u'!"#%\'()*+,-./:;<=>?@[\]^_`{|}~’”“': print c.encode('utf-8'),ord(c.encode('utf-8')),
! 33 " 34 # 35 % 37 ' 39 ( 40 ) 41 * 42 + 43 , 44 - 45 . 46 / 47 : 58 ; 59 < 60 = 61 > 62 ? 63 @ 64 [ 91 \ 92 ] 93 ^ 94 _ 95 ` 96 { 123 | 124 } 125 ~ 126 ’---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-93-1eb3985b825b> in <module>()
----> 1 for c in u'!"#%\'()*+,-./:;<=>?@[\]^_`{|}~’”“': print c.encode('utf-8'),ord(c.encode('utf-8')),

TypeError: ord() expected a character, but string of length 3 found

I can relate this length of three to nine question marks as well as the required (three) bytes for each of the three non-ascii characters.

Start `ipython notebook` to get the web-interface and avoid Unicode + Windows console issues. Print Unicode (remove `.encode('utf-8')`) in this case. — jfs, Jan 20 '15 at 13:07
If you are not on Windows; to fix the output, just remove `.encode('utf-8')` (print Unicode directly). — jfs, Jan 20 '15 at 13:48
I'm on OSX and got the same output (i.e. question marks) from ipython notebook as well. — tozCSS, Jan 20 '15 at 18:10
have you dropped `.encode('utf-8')` as I said? Add `assert type(c) == unicode` — jfs, Jan 20 '15 at 18:12

Mark Ransom · Accepted Answer · 2015-01-21T17:09:00.267

2

When you encode a string, it converts from Unicode characters to bytes. Each character can become a variable number of bytes. It appears your console prints any bytes outside of the ASCII range of 0-127 as a question mark.

>>> for c in u'!"#%\'()*+,-./:;<=>?@[\]^_`{|}~’”“': print len(c.encode('utf-8')),

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3


>>> for c in u'!"#%\'()*+,-./:;<=>?@[\]^_`{|}~’”“'.encode('utf-8'): print c,ord(c),

! 33 " 34 # 35 % 37 ' 39 ( 40 ) 41 * 42 + 43 , 44 - 45 . 46 / 47 : 58 ; 59 < 60 = 61 > 62 ? 63 @ 64 [ 91 \ 92 ] 93 ^ 94 _ 95 ` 96 { 123 | 124 } 125 ~ 126 â 226 ﾀ 128 ﾙ 153 â 226 ﾀ 128 ﾝ 157 â 226 ﾀ 128 ﾜ 156

The reason you get question marks is that you've broken up the UTF-8 sequences, and each byte of the sequence isn't a valid character by itself. Those invalid characters are displayed as question marks.

As you can see, my console (Python 2.7.5's IDLE) doesn't print question marks but substitutes incorrect characters instead.

edited Jan 21 '15 at 17:09

answered Jan 19 '15 at 19:31

Mark Ransom

299,747
42
398
622

it should print `3` instead `2`. I've copy-pasted the characters from your answer `u'’”“'` they are `u'\u2019\u201d\u201c'`. Any other value is the issue with your environment. [U+0092](http://codepoints.net/U+0092), etc are control characters unrelated to the quotes. – jfs Jan 20 '15 at 13:16
@J.F.Sebastian I get the same result on two different PCs, both running Windows 7, copying from Chrome and pasting into the Python 2.7 IDLE console. – Mark Ransom Jan 20 '15 at 13:20
You can check whether It is a Windows clipboard corrupts the value (unlikely) or IDLE while sending them to a child process: `print repr(u'\u2019\u201d\u201c'), u'\u2019\u201d\u201c'`. If IDLE on Python 2 supports it; try to start it using "same/single process" mode. – jfs Jan 20 '15 at 13:26
`u'\u2019'.encode('cp1252').decode('latin1') == u'\x92'` – jfs Jan 20 '15 at 13:37
@J.F.Sebastian the problem is with Idle, if I read the characters directly from the clipboard I get the right values. – Mark Ransom Jan 20 '15 at 14:17
If it is reproducable on Python 2.7.9; could your [submit a bug report](http://bugs.python.org/)? – jfs Jan 20 '15 at 14:30
@J.F.Sebastian it's already fixed in Python 3.2. I've had other problems with Unicode input in Idle 2.7, I've just about given up. – Mark Ransom Jan 20 '15 at 14:37
As far as I know, Python 2.7 bugs could be fixed if you report it. – jfs Jan 20 '15 at 14:39
"It appears your console prints any bytes outside of the ASCII range of 0-127 as a question mark." Yes, it is obvious but I still don't know why and how to fix? Every setting appears to be UTF-8 as I commented below @bav's answer. – tozCSS Jan 20 '15 at 18:21
@oztalha I've updated my answer with the final piece of the puzzle. The way to fix it is to stop trying to print out each of the UTF-8 bytes as if it were a character, because it isn't. – Mark Ransom Jan 20 '15 at 18:54
@MarkRansom thanks. Now I know why I got it also thanks to this [SO answer](http://stackoverflow.com/a/8609051/1054154). – tozCSS Jan 20 '15 at 18:55
@J.F.Sebastian it has been reported multiple times, but nobody can convince the devs that it's a real or solvable issue. See e.g. http://bugs.python.org/issue17348 and http://bugs.python.org/issue15809. – Mark Ransom Jan 21 '15 at 00:03
@MarkRansom: yes. the issues are the same: clipboard correctly passes `u'\u2019'` character, IDLE encodes it to cp1252 (Python source encoding on your machine in IDLE -- it should be the same as Windows codepage) as if you've typed it manually, something (`compile()` function?) (I don't understand why [pep-0263](https://www.python.org/dev/peps/pep-0263/) is mentioned) uses latin1 encoding. Do you see `ord(u'€') == 128` on your machine? What value do you get `ord(u'\u20ac')`? I think it is worth to leave a message on http://bugs.python.org/issue15809 (the issue is in "patch review" stage). – jfs Jan 21 '15 at 00:36
@J.F.Sebastian I finally got motivated to find my own answer to the problem. See http://stackoverflow.com/a/28060419/5987 – Mark Ransom Jan 21 '15 at 05:47

score 0 · Answer 2 · answered Jan 19 '15 at 19:16

0

Seems you have "C" locale. You can check it with locale shell command. Or you can run ipython with LANG=en_US.UTF-8 ipython (for bash/sh/etc)

answered Jan 19 '15 at 19:16

bav

1,543
13
13

It appears not to be related to locale: (venv) toz$ locale LANG="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_CTYPE="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_ALL= – tozCSS Jan 19 '15 at 19:53
I guess this is a ssh session. Then problem is in settings of your terminal. You use putty? – bav Jan 19 '15 at 19:59
Do you use Terminal.app? Try iTerm2. It's very interesting issue. – bav Jan 19 '15 at 20:02
Yes, regular terminal. I got the same result in notebook as well: [screenshot](http://awesomescreenshot.com/037484q935) – tozCSS Jan 19 '15 at 20:16
Last guess. What value has sys.stdout.encoding? – bav Jan 19 '15 at 20:37
In [2]: sys.stdout.encoding Out[2]: 'UTF-8' – tozCSS Jan 19 '15 at 21:01
I'm a dumb-dumb. I though that first expression also outputs question marks somehow. @MarkRansom give a correct answer. Your terminal font has no latin symbols with umlauts. – bav Jan 20 '15 at 06:57

Encoding a unicode string to utf-8 and getting question marks

2 Answers2