0

I have the following code:

# -*- coding: utf-8 -*-
print "╔╤╤╦╤╤╦╤╤╗"
print "╠╪╪╬╪╪╬╪╪╣"
print "╟┼┼╫┼┼╫┼┼╢"
print "╚╧╧╩╧╧╩╧╧╝"
print "║"
print "│"

and for some reason, only the third line (╚╧╧╩╧╧╩╧╧╝) actually outputs properly, the rest is an odd combination of symbols. I assume this is due to some encoding issues. The full output in IDLE is as follows:

╔╤╤╦╤╤╦╤╤╗
╠╪╪╬╪╪╬╪╪╣
╟┼┼╫┼┼╫┼┼╢
╚╧╧╩╧╧╩╧╧╝
â•‘
│

What is causing this and how can I fix this? I'm using a tablet (Surface Pro 3 with Win10) with only a touch keyboard, so any solution with the least amount of typing (especially typing out weird characters) would be ideal, but obviously all help is appreciated.

CharlieDeBeadle
  • 129
  • 1
  • 1
  • 9
  • It must be a local issue, because it works fine on *nix systems here, local and remote. http://ideone.com/DeanM5 – l'L'l Aug 11 '15 at 15:24
  • 1
    I might be crazy, but don't you need to prefix unicode strings in Python 2? E.g., `print u"╠╪╪╬╪╪╬╪╪╣"`. See https://docs.python.org/2/howto/unicode.html#unicode-literals-in-python-source-code – Fred Larson Aug 11 '15 at 15:31
  • @FredLarson Without the "u", the strings are just UTF-8 byte streams that don't need further encoding before being passed to the terminal. – chepner Aug 11 '15 at 15:37
  • @FredLarson Thanks, that fixed it! I wonder why line 3 worked fine... – CharlieDeBeadle Aug 11 '15 at 15:47
  • @chepner: Yes, but do UTF-8 byte streams work on Windows? I believe Windows uses UTF-16 natively. – Fred Larson Aug 11 '15 at 15:54
  • Depends on the terminal. UTF-16 is what Windows uses internally to store Unicode code points as bytes. – chepner Aug 11 '15 at 15:56
  • @chepner: IDLE is a *GUI* program. What encoding the terminal uses (such as cp866) is irrelevant in this case. Unicode strings are not utf-8 byte streams. They are immutable sequences of Unicode codepoints in Python. Python uses bytes-based interfaces to talk to a terminal and therefore you have to encode Unicode to bytes. Again, it is unrelated to IDLE. The exception is Windows console where (in principle) you could use Unicode API, [see `win-unicode-console` package.](https://github.com/Drekin/win-unicode-console/) – jfs Aug 11 '15 at 18:39

2 Answers2

1

Mojibake indicates that the text encoded in one encoding is shown in another incompatible encoding:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
print(u"╔╤╤╦╤╤╦╤╤╗".encode('utf-8').decode('cp1252')) #XXX: DON'T DO IT
# -> ╔╤╤╦╤╤╦╤╤╗

There are several places where the wrong encoding could be used.

# coding: utf-8 encoding declaration says how non-ascii characters in your source code (e.g., inside string literals) should be interpreted. If print u"╔╤╤╦╤╤╦╤╤╗" works in your case then it means that the source code itself is decoded to Unicode correctly. For debugging, you could write the string using only ascii characters: u'\u2554\u2557' == u'╔╗'.

print "╔╤╤╦╤╤╦╤╤╗" (DON'T DO IT) prints bytes (text encoded using utf-8 in this case) as is. IDLE itself works with Unicode (BMP). The bytes must be decoded into Unicode text before they can be shown in IDLE. It seems IDLE uses ANSI code page such as cp1252 (locale.getpreferredencoding(False)) to decode the output bytes on Windows. Don't print text as bytes. It will fail in any environment that uses a character encoding different from your source code e.g., you would get ΓòöΓòù... mojibake if you run the code from the question in Windows console that uses cp437 OEM code page.

You should use Unicode for all text in your program. Python 3 even forbids non-ascii characters inside a bytes literal. You would get SyntaxError there.

print(u'\u2554\u2557') might fail with UnicodeEncodeError if you would run the code in Windows console and OEM code page such as cp437 weren't be able to represent the characters. To print arbitrary Unicode characters in Windows console, use win-unicode-console package. You don't need it if you use IDLE.

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
0

Putting a u before the strings fixed the issue, as per @FredLarson's suggestion:

print u"╔╤╤╦╤╤╦╤╤╗"
print u"╠╪╪╬╪╪╬╪╪╣"
print u"╟┼┼╫┼┼╫┼┼╢"
print u"╚╧╧╩╧╧╩╧╧╝"
print u"║"
print u"│"

The exact cause still isn't known, since it seemed to work on other systems and it's odd that the third line worked fine.

CharlieDeBeadle
  • 129
  • 1
  • 1
  • 9
  • There are three steps involved here: 1) The file itself contains the UTF-8 encoding of the characters. They appear as box drawing characters because your editor understand UTF-8 and decodes it for display. 2) The `u"..."` tells Python to create Unicode objects, meaning the UTF-8 is decoded to the appropriate code points, which are (internally) re-encoded for storage in memory. 3) On output, the Unicode strings are automatically re-encoded as UTF-8 for display on the terminal. – chepner Aug 11 '15 at 16:01
  • @chepner: it is not correct. (1) the file (by mistake) may contain characters encoded using some other character encoding e.g., if you type it in Windows console then `cp437` (OEM cp) may be used. (2) Python uses the encoding specified in `coding:` declaration to decode `u"..."` literals. It is utf-8 in this case but it can be some other encoding. (3) What encoding is used on output depends on how/where the script is run. The question indicates that [IDLE uses cp1252 to decode the output bytes in this case](http://stackoverflow.com/a/31949236/4279). It might skip reencoding for Unicode. – jfs Aug 11 '15 at 18:29