2

This simple program on OSX 10.6.8, python 3.4, Terminal.app and font Menlo prints three unicode characters: a smiley, a warning sign, and a radioactive symbol, or maybe I should say should print, because in fact I only get the first and the last. The warning sign is not there.

from curses import wrapper

def main(stdscr):
    # Clear screen
    stdscr.clear()

    for i in range(1, 11):
        stdscr.addstr(i, 0, '\u263a \u26a0 \u2622'.encode("utf-8"))

    stdscr.refresh()
    stdscr.getkey()

wrapper(main)

Additionally, if I open Font Book, apparently Menlo does have a glyph for the warning sign, but what puzzles me the most is that if I go to Edit -> Special characters, select the warning sign, and click Insert, I get a warning sign at the command prompt. Also using print() shows the warning sign.

What's going on?

EDIT: Apparently it's a bug in the OSX libc library. See here

How to get ncurses to output astral plane unicode characters

I tried compiling the small program to get the wcinfo

sbo@sbos-macbook:~$ ./wcinfo 26a0
Code 26A0: width -1 
sbo@sbos-macbook:~$ ./wcinfo 263a
Code 263A: width 1 punct graph print 

So, for the warning sign, we get a -1, which means non-printable character. So, definitely an OSX problem, and a fundamental one.

Community
  • 1
  • 1
Stefano Borini
  • 138,652
  • 96
  • 297
  • 431

2 Answers2

1

When I run it on my Mac OS X 10.10 (Yosemite) terminal using Lucida Console as the font, I get the output shown below:

$ printf "%s\n" u+263a u+0020 u+26a0 u+0020 u+2622 | unicode-utf8
☺ ⚠ ☢
$  printf "%s\n" u+263a u+0020 u+26a0 u+0020 u+2622 | unicode-utf8 | odx
0x0000: E2 98 BA 20 E2 9A A0 20 E2 98 A2 0A               ... ... ....
0x000C:
$ printf "%s\n" u+263a u+0020 u+26a0 u+0020 u+2622 | unicode-utf8 | utf8-unicode
(standard input):
0xE2 0x98 0xBA = U+263A
0x20 = U+0020
0xE2 0x9A 0xA0 = U+26A0
0x20 = U+0020
0xE2 0x98 0xA2 = U+2622
0x0A = U+000A
$

The program unicode-utf8, utf8-unicode, and odx are all home-brew programs (the Unicode ones are not particularly elegant), but they allow me to do analysis work with Unicode. And, at least on my computer, all three symbols show up. When they were not separated by spaces, then the triangle and the radiation symbols overlapped on the screen (unlike in the browser), which is why I added the spaces:

☺⚠☢

So, I suggest looking hard at the output of the script you show. You might be seeing an encoding problem, or the curses library may not be properly aware of UTF-8, or …

When I run with Python 2, I get:

\u263a \u26a0 \u2622
\u263a \u26a0 \u2622
\u263a \u26a0 \u2622
\u263a \u26a0 \u2622
\u263a \u26a0 \u2622
\u263a \u26a0 \u2622
\u263a \u26a0 \u2622
\u263a \u26a0 \u2622
\u263a \u26a0 \u2622
\u263a \u26a0 \u2622

When I run with Python 3, I get:

☺   ☢
☺   ☢
☺   ☢
☺   ☢
☺   ☢
☺   ☢
☺   ☢
☺   ☢
☺   ☢
☺   ☢

This means that I can reproduce the problem, but it seems to be a problem in Python rather than in the terminal.

I ran:

$ python3 so.26919799.py > py3.output
$ odx py3.output

The relevant part of the output is:

0x1D60: 20 20 20 20 20 20 20 1B 5B 36 35 3B 31 48 20 20          .[65;1H  
0x1D70: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20                   
* (5)
0x1DD0: 20 20 20 20 20 20 20 20 20 20 20 08 20 08 1B 5B              . ..[
0x1DE0: 34 68 20 1B 5B 34 6C 1B 5B 48 0A E2 98 BA 20 20   4h .[4l.[H....  
0x1DF0: 20 E2 98 A2 0D 0A E2 98 BA 20 20 20 E2 98 A2 0D    ........   ....
0x1E00: 0A E2 98 BA 20 20 20 E2 98 A2 0D 0A E2 98 BA 20   ....   ........ 
0x1E10: 20 20 E2 98 A2 0D 0A E2 98 BA 20 20 20 E2 98 A2     ........   ...
0x1E20: 0D 0A E2 98 BA 20 20 20 E2 98 A2 0D 0A E2 98 BA   .....   ........
0x1E30: 20 20 20 E2 98 A2 0D 0A E2 98 BA 20 20 20 E2 98      ........   ..
0x1E40: A2 0D 0A E2 98 BA 20 20 20 E2 98 A2 0D 0A E2 98   ......   .......
0x1E50: BA 20 20 20 E2 98 A2 1B 5B 3F 31 6C 1B 3E 1B 5B   .   ....[?1l.>.[
0x1E60: 6D 0D 1B 5B 35 34 42 1B 5B 4B 1B 5B 36 35 3B 31   m..[54B.[K.[65;1
0x1E70: 48 1B 5B 32 4A 1B 5B 3F 34 37 6C 1B 38 0D 1B 5B   H.[2J.[?47l.8..[
0x1E80: 3F 31 6C 1B 3E                                    ?1l.>
0x1E85:

The 0x1D60: indicates a byte offset in the file. My terminal window is 110 wide and 65 deep, so there were a lot of blanks being generated by the output. The * (5) line indicates 5 more lines of 16 blanks. Then you can see some data containing bytes E2 98 BA and E2 98 A2, but in between there are three blanks, instead of the E2 98 A0 you'd expect. So, the translation of the alert symbol is being mishandled by Python 3.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • Interesting... very interesting. I'll post it to python-dev and hear what they think about it. – Stefano Borini Nov 16 '14 at 10:14
  • @eryksun: Using Python 3.4.0 (rather than 3.4.2, the latest version of Python, which I've not yet created on my Mac), and spell-fixing `getpreferredencoding()`, adding the locale code you suggest doesn't make a difference to the display. – Jonathan Leffler Nov 16 '14 at 17:21
  • You get the same error in Python 2? If so it's probably a curses library bug that has nothing to do with Python. It works for me in Linux w/ ncurses 5.9.20140118. – Eryk Sun Nov 16 '14 at 17:44
  • 1
    Questions about Python bugs should go to python-list (accessible via news.gmane.net withough subscribing), rather than python-dev. – Terry Jan Reedy Nov 16 '14 at 19:34
  • @eryksun: I used `stdscr.addstr(i, 0, u'\u263a \u26a0 \u2622'.encode("utf-8"))` with Python 2.7.6 and didn't get the middle triangular warning sign, U+26A0. I messing with some C code to see what I can get. The terminal is capable of displaying the character, so the problem is in some aspect of the software displaying the data. – Jonathan Leffler Nov 16 '14 at 19:35
  • Did you test ncurses in C? – Eryk Sun Nov 16 '14 at 21:51
  • Try a simple `ncursesw` test program: `int main() {` `setlocale(LC_ALL, "");` `initscr();` `mvaddstr(1, 0, "\xe2\x98\xba \xe2\x9a\xa0 \xe2\x98\xa2");` `refresh();` `getch();` `endwin();` `return 0;}`. If the OP is right, the warning sign will be skipped in OS X as a non-printable character. – Eryk Sun Nov 16 '14 at 22:56
1

The warning sign prints fine in Idle's tkinter text widget with Lucida Console on 3.4.2 Win 7. Moreover, Python correctly utf-8 encodes and decodes the character. This is contrary to "python fails to correctly encode \u26a0 (warning sign) to utf-8", which Stefano posted to py-dev.

>>> s='\u26a0'
>>> s
'⚠'  # up-pointing triangle /_\ with ! inside
>>> b=s.encode('utf-8')
>>> b
b'\xe2\x9a\xa0'  # E2 9A A0 is what Jonathan said is correct.
>>> b.decode('utf-8')
'⚠'

Is stdscr an extra builtin name on OSX? or is there code missing that defines it?

Terry Jan Reedy
  • 18,414
  • 3
  • 40
  • 52
  • Any comment about 'Python fails to encode it correctly' must be understood to be qualified with 'for the particular version of Python built for use on Mac OS X' (though I don't know whether it was so qualified when posted on python-dev). I'm using the system-provided Python 2.7.6 or 3.4.0 builds on OS X 10.10 Yosemite. I'm sure it is not a general Python bug — but it does seem to be reproducible on two separate Macs (Stefano's and mine). The `stdscr` is a name from the `curses` module (or `curses` `wrapper` module), at least on Mac. – Jonathan Leffler Nov 16 '14 at 20:43
  • Does `'u2620'.encode('utf-8' == b''\xe2\x9a\xa0'` fail on Mac? If not, and I strongly suspect not, then Python *is* encoding the char correctly. As near as I can tell, Stefano and you only have a problem when using stdscr/curses, which wraps the system curses. So it appears that stdscr.addstr gets properly encoded bytes -- easy to test by separating encoding and call. Both I and the other developer who responded to Stefano's post suspect that the problem in in the system curses, not the Python wrapper. It is unlikely that the wrapper selectively filters out the warning sign. – Terry Jan Reedy Nov 16 '14 at 21:24
  • I tried this code (spread over four lines when run): `if 'u2620'.encode('utf-8') == b'\xe2\x9a\xa0': print("Equal") else: print("Unequal")` with Python 3; it printed Unequal. (I had the lines `import locale`, `locale.setlocale(locale.LC_ALL, '')`, and `encoding = locale.getpreferredencoding()` in the file too. I used the `print "String"` notation with Python 2; it too printed Unequal. – Jonathan Leffler Nov 16 '14 at 21:38
  • @JonathanLeffler, use `u'\u26a0'.encode('utf-8') == b'\xe2\x9a\xa0'`. That should work in 2.6, 2.7 and 3.3+. – Eryk Sun Nov 16 '14 at 21:40
  • @eryksun: your suggested notation worked correctly with both 2.7.6 and 3.4.0, printing "Equal" in both cases (with the appropriate `print` notations, of course). Also, FWIW, running `print(u'\u263a \u26a0 \u2622'.encode("utf-8"))` in Python 3 and the corresponding code in Python 2 (no curses around) prints "☺ ⚠ ☢" so the problem is probably in the Python interface to the curses library or in the curses library itself, rather than in raw Python. – Jonathan Leffler Nov 16 '14 at 21:43
  • @JonathanLeffler My apology for leaving off the '\'. To decide where the problem is, someone could access the OSX curses library with equivalent code in a different language. Or someone could try reading the C-coded curses wrapper. – Terry Jan Reedy Nov 16 '14 at 22:30
  • @JonathanLeffler: Updated question. Seems to be a libc problem. – Stefano Borini Nov 16 '14 at 22:42