1

I'm working in the windows console and I can not print the superscript digits. This is what I get:

>>> '¹²³⁴⁵⁶⁷⁸⁹'
'1²345678?'

>>> for i in '¹²³⁴⁵⁶⁷⁸⁹': print(i, i.encode())
...
1 b'1'          # expect  b'\x00\xb9' (U+00B9)
² b'\xc2\xb2'   # expect  b'\x00\xb2' (U+00B2)
3 b'3'          # expect  b'\x00\xb3' (U+00B2)
4 b'4'          # expect  b'\x20\x74' (U+2074)
5 b'5'          # expect  b'\x20\x75' (U+2075)
6 b'6'          # expect  b'\x20\x76' (U+2076)
7 b'7'          # expect  b'\x20\x77' (U+2077)
8 b'8'          # expect  b'\x20\x78' (U+2078)
? b'?'          # expect  b'\x20\x79' (U+2079)

I tried to set the environment variable PYTHONIOENCODING this way

set PYTHONIOENCODING=utf-8

but what I get is this

>>> '¹²³⁴⁵⁶⁷⁸⁹'
   File "<stdin>", line 0

     ^
SyntaxError: 'utf-8' codec can not decode bytes 0xfd in position 2: invalid start byte

the problem in this case is the '²', in fact replacing it I get

>>> '¹2³⁴⁵⁶⁷⁸⁹'
'12345678?'

How can I fix? Thanks!

Arctic Pi
  • 669
  • 5
  • 19
  • I think that there are some things that the windows console simply cannot do.. You can google `chcp` and take it from there – Ma0 Mar 02 '17 at 10:05
  • The console's support for codepage 65001 (UTF-8) is buggy (e.g. it doesn't support non-ASCII input, even in WSL Linux subsystem in Windows 10), so using UTF-8 is not the answer. The solution is to use the wide-character functions `ReadConsoleW` and `WriteConsoleW` to read and write UTF-16 to the console. Python 3.6 has a new Windows console I/O implementation that does this, and for older versions you can install [`win_unicode_console`](https://pypi.python.org/pypi/win_unicode_console). – Eryk Sun Mar 02 '17 at 16:05
  • @Ev.Kounis, the Windows console (conhost.exe) can potentially display all characters in the Unicode basic multilingual plane (BMP). Surrogate pairs are at least preserved. The console's default font support is limited since it doesn't seem to use Uniscribe. However, you can manually define fallback links in the registry key `HKLM\SoftwareMicrosoft\Windows NT\CurrentVersion\FontLink\SystemLink`. For example, create a new multi-string value named "Consolas" if you use this font. Copy links from the existing values such as `MINGLIU.TTC,PMingLiU` and `SIMSUN.TTC,SimSun`. – Eryk Sun Mar 02 '17 at 17:06
  • Possible duplicate of [Python, Unicode, and the Windows console](http://stackoverflow.com/questions/5419/python-unicode-and-the-windows-console) – roeland Mar 03 '17 at 02:14
  • Use Python 3.6. Works fine. Python 3.6 uses the Windows Unicode APIs and fixes a lot of problems with Unicode in the command prompt. – Mark Tolonen Mar 03 '17 at 03:53

1 Answers1

2

Eryksun's comment is right: The console's support for codepage 65001 (UTF-8) is buggy. However, there is a workaround: create a .py script (save in UTF-8):

import unicodedata
x=u'¹²³⁴⁵⁶⁷⁸⁹'
for i in x:
    print( i, 
        unicodedata.normalize('NFKC', i),
        i.encode(),                        # the same as i.encode('utf-8')
        hex(ord(i)),
        ''
        )

Output - above script used as follows:

D:\bat\SO> set python
PYTHONIOENCODING=UTF-8

D:\bat\SO> chcp
Active code page: 65001

D:\bat\SO> D:\test\Python\Py3\42552164.py

¹ 1 b'\xc2\xb9' 0xb9
² 2 b'\xc2\xb2' 0xb2
³ 3 b'\xc2\xb3' 0xb3
⁴ 4 b'\xe2\x81\xb4' 0x2074
⁵ 5 b'\xe2\x81\xb5' 0x2075
⁶ 6 b'\xe2\x81\xb6' 0x2076
⁷ 7 b'\xe2\x81\xb7' 0x2077
⁸ 8 b'\xe2\x81\xb8' 0x2078
⁹ 9 b'\xe2\x81\xb9' 0x2079

D:\bat\SO>

Environment:

  • Windows 8.1,
  • Python 3.5,
  • cmd window font Consolas or DejaVu Sans Mono.

Resources: The Python Standard Library.

Update in view of further Eryksun's comments. I don't think that a script workaround is perfect. For instance, output from print(x) (added to above script) will have some trailing garbage looking as follows:

¹²³⁴⁵⁶⁷⁸⁹
�⁶⁷⁸⁹
⁸⁹
��

None the less, it's surely better than totally crashing Python console due to any non-ASCII input:

D:\bat\SO> py -3
Python 3.5.1 (v3.5.1:37a07cee5969, Dec  6 2015, 01:54:25) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> x=u'¹²³⁴⁵⁶⁷⁸⁹'


D:\bat\SO>
Community
  • 1
  • 1
JosefZ
  • 28,460
  • 5
  • 44
  • 83
  • Try using codepage 65001 in Windows 7. The output will have trailing garbage because `WriteFile` mistakenly returns the number of decoded UTF-16 code points instead of the number of bytes written. The error is in the console itself (that's conhost.exe; not the cmd.exe shell). This bug was fixed in Windows 8 when Microsoft completely rewrote the IPC communication with the console to use real kernel handles for the ConDrv device driver. – Eryk Sun Mar 02 '17 at 23:41
  • Even in the latest Windows 10, you can't actually read non-ASCII (as in anything but codes 1-127) using codepage 65001. It's a more fundamental bug in conhost.exe that will require a rewrite to no longer assume that the input codepage is SBCS or DBCS. UTF-8 is variable sized from 1-4 bytes, which requires smarter buffer allocation when the console calls `WideCharToMultiByte` to encode its UTF-16 input buffer as UTF-8. The encode fails if the input buffer has even 1 non-ASCII character, which causes `ReadFile` to return that 0 bytes were successfully read, which means end of file (EOF). – Eryk Sun Mar 02 '17 at 23:51
  • What you're seeing in 3.5 when trying to input `'¹²³⁴⁵⁶⁷⁸⁹'` isn't really a crash. The REPL in 3.5 uses the standard I/O implementation to read from the console. This calls the CRT's `read` function, which in binary mode calls `ReadFile`. Because you're using the console's buggy implementation of codepage 65001, this call 'successfully' reads 0 bytes, and the REPL legitimately interprets this as EOF and quits normally. It doesn't crash. In 3.6+, this all works correctly using the Windows wide-character API. The solution for versions prior to 3.6 is to install and enable `win_unicode_console`. – Eryk Sun Mar 03 '17 at 00:28