19

The Windows console has been Unicode aware for at least a decade and perhaps as far back as Windows NT. However for some reason the major cross-platform scripting languages including Perl and Python only ever output various 8-bit encodings, requiring much trouble to work around. Perl gives a "wide character in print" warning, Python gives a charmap error and quits. Why on earth after all these years do they not just simply call the Win32 -W APIs that output UTF-16 Unicode instead of forcing everything through the ANSI/codepage bottleneck?

Is it just that cross-platform performance is low priority? Is it that the languages use UTF-8 internally and find it too much bother to output UTF-16? Or are the -W APIs inherently broken to such a degree that they can't be used as-is?

UPDATE

It seems that the blame may need to be shared by all parties. I imagined that the scripting languages could just call wprintf on Windows and let the OS/runtime worry about things such as redirection. But it turns out that even wprintf on Windows converts wide characters to ANSI and back before printing to the console!

Please let me know if this has been fixed since the bug report link seems broken but my Visual C test code still fails for wprintf and succeeds for WriteConsoleW.

UPDATE 2

Actually you can print UTF-16 to the console from C using wprintf but only if you first do _setmode(_fileno(stdout), _O_U16TEXT).

From C you can print UTF-8 to a console whose codepage is set to codepage 65001, however Perl, Python, PHP and Ruby all have bugs which prevent this. Perl and PHP corrupt the output by adding additional blank lines following lines which contain at least one wide character. Ruby has slightly different corrupt output. Python crashes.

UPDATE 3

Node.js is the first scripting language that shipped without this problem straight out of the box.

The Python dev team slowly came to realize this was a real problem since it was first reported back at the end of 2007 and has seen a huge flurry of activity to fully understand and fully fix the bug in 2016.

hippietrail
  • 15,848
  • 18
  • 99
  • 158
  • 5
    You can't "output Unicode". Unicode is a method of representing characters internally as code points. To output it, you need some form of encoding - probably UTF-8. – Daniel Roseman Feb 09 '11 at 10:49
  • 2
    Of course you can output Unicode. On *nix the standard encoding to output Unicode in is UTF-8. On Windows the standard way to output in is UTF-16, except that in the Windows world they say "Unicode" when they mean UTF-16. This probably goes for Java too and anywhere else where UTF-8 is not primary. – hippietrail Feb 09 '11 at 12:54
  • 5
    @Daniel: if you don't like the terminology, then replace it with "print arbitrary Unicode characters on the console if appropriate conditions (font support etc.) are met". UTF-8 is part of the Unicode standard, which does much more than just assign code points. – Philipp Feb 09 '11 at 13:10
  • @Daniel: Unicode has specific terminology in which "encoding" means exactly "method of representing characters as code points". Compare this with UTF which stands for "Unicode transformation format" which is the process of representing codepoints as a stream of bytes or words etc. Outside the Unicode world, the mapping of characters to numbers (codepoints) and the transformation of a series of codepoints into a string of bytes or words are blurred together as "encoding". Confusing and annoying perhaps but that's how it is. – hippietrail Feb 11 '11 at 09:56
  • node.js is the first scripting language I have found that works out of the box with Unicode in the console on both *nix and Windows systems! Of course it's not intended as a regular scripting language but for server-size node-based stuff, so many features you'd expect from a scripting language are missing. ([It's not easy to read text line-by-line for instance.](http://stackoverflow.com/questions/6156501)) – hippietrail Jan 02 '13 at 02:34

9 Answers9

20

The main problem seems to be that it is not possible to use Unicode on Windows using only the standard C library and no platform-dependent or third-party extensions. The languages you mentioned originate from Unix platforms, whose method of implementing Unicode blends well with C (they use normal char* strings, the C locale functions, and UTF-8). If you want to do Unicode in C, you more or less have to write everything twice: once using nonstandard Microsoft extensions, and once using the standard C API functions for all other operating systems. While this can be done, it usually doesn't have high priority because it's cumbersome and most scripting language developers either hate or ignore Windows anyway.

At a more technical level, I think the basic assumption that most standard library designers make is that all I/O streams are inherently byte-based on the OS level, which is true for files on all operating systems, and for all streams on Unix-like systems, with the Windows console being the only exception. Thus the architecture many class libraries and programming language standard have to be modified to a great extent if one wants to incorporate Windows console I/O.

Another more subjective point is that Microsoft just did not enough to promote the use of Unicode. The first Windows OS with decent (for its time) Unicode support was Windows NT 3.1, released in 1993, long before Linux and OS X grew Unicode support. Still, the transition to Unicode in those OSes has been much more seamless and unproblematic. Microsoft once again listened to the sales people instead of the engineers, and kept the technically obsolete Windows 9x around until 2001; instead of forcing developers to use a clean Unicode interface, they still ship the broken and now-unnecessary 8-bit API interface, and invite programmers to use it (look at a few of the recent Windows API questions on Stack Overflow, most newbies still use the horrible legacy API!).

When Unicode came out, many people realized it was useful. Unicode started as a pure 16-bit encoding, so it was natural to use 16-bit code units. Microsoft then apparently said "OK, we have this 16-bit encoding, so we have to create a 16-bit API", not realizing that nobody would use it. The Unix luminaries, however, thought "how can we integrate this into the current system in an efficient and backward-compatible way so that people will actually use it?" and subsequently invented UTF-8, which is a brilliant piece of engineering. Just as when Unix was created, the Unix people thought a bit more, needed a bit longer, has less financially success, but did it eventually right.

I cannot comment on Perl (but I think that there are more Windows haters in the Perl community than in the Python community), but regarding Python I know that the BDFL (who doesn't like Windows as well) has stated that adequate Unicode support on all platforms is a major goal.

Philipp
  • 48,066
  • 12
  • 84
  • 109
  • 2
    +1 Very informative take on a question that regularly frustrates me. – David Heffernan Feb 09 '11 at 16:18
  • I've accepted this as the answer as it's the only one that seriously tries to answer my question literally, even though I still don't have a way to output Unicode to the Windows console in either Perl or Python! But I have some further comments: – hippietrail Feb 11 '11 at 09:58
  • 1
    Are the wprintf() and related functions part of the standard C library or purely MS extensions? Is iconv() part of the standard C library? Do either Perl or Python declare somewhere that they adhere strictly to the standard C library and avoid things which may be extensions such as wprintf() and iconv()? By the way I have done Unicode before in C/C++ for the AbiWord cross-platform word processor in which I implemented the encoded text save and load functionality. But these days I prefer scripting languages since I mostly do multilanguage text processing. – hippietrail Feb 11 '11 at 10:04
  • 2
    @hippietrail: `wprintf` is standard C, but `_setmode` and `_fileno` are not. Often (but not always) Microsoft prepends non-standard extensions with an underscore. `iconv` is not part of the C standard. Neither Perl nor Python use pure C without extensions because even some very common things such as reading directory contents or creating links aren't included in the C standard. Lua uses only standard C functions in its standard library, but even then it has to use extensions for dynamical module loading. – Philipp Feb 11 '11 at 14:02
  • 2
    If Perl and Python don't use the Microsoft extensions for Unicode output, you have to do it yourself. Console output in Windows always needs to go through `WriteConsoleW`, there is just no other way. See e.g. [this long discussion](http://bugs.python.org/issue1602) (where many contributors incorrectly think that Unicode doesn't work in the Windows console or that is has anything to do with codepages). It contains a link to [a possible fix](http://tahoe-lafs.org/trac/tahoe-lafs/browser/src/allmydata/windows/fixups.py), but in general the Python standard library has to be rewritten. – Philipp Feb 11 '11 at 14:14
9

Small contribution to the discussion - I am running Czech localized Windows XP, which almost everywhere uses CP1250 code page. Funny thing with console is though that it still uses legacy DOS 852 code page.

I was able to make very simple perl script that prints utf8 encoded data to console using:

binmode STDOUT, ":utf8:encoding(cp852)";

Tried various options (including utf16le), but only above settings printed accented Czech characters correctly.

Edit: I played a little more with the problem and found Win32::Unicode. The module exports function printW that works properly both in output and redirected:

use utf8;
use Win32::Unicode;

binmode STDOUT, ":utf8";
printW "Příliš žluťoučký kůň úpěl ďábelské ódy";
bvr
  • 9,687
  • 22
  • 28
  • Same with cyrillic. All 8-bit API uses CP1251, "ANSI encoding" as they call it; and the console API uses CP866 — old, from times of DOS old codepage; they call it "OEM encoding". – ulidtko Feb 09 '11 at 11:07
  • Actually the Windows console supports many encodings, not all equally well. You can call the W functions to output arbitrary unicode text in which case it doesn't matter what the native language or locale of your system is (yes it has to be in UCS-2). You can call the A functions to output "ANSI" text which can be any supported 8-bit or multibyte (Chinese, Japanese, Korean) codepage. The default codepage will be an old IBM 8-bit one if DOS supported your language but not if it's a language that's only been supported recently (such as Hindi). You can override this with the CHCP command. – hippietrail Feb 11 '11 at 10:18
  • 2
    The old IBM code pages (such as 852) are used for compatibility because they include graphics characters which were used in many old DOS apps - and many of these are still being used! The newer code pages (such as 1250) were introduced for Windows and don't include the legacy graphics characters needed for console apps. – hippietrail Feb 11 '11 at 10:20
  • @hippietrail I realize that there is rationale for having backwards compatibility. Also thanks bringing up `chcp`, I did not know about it. Is there any way to enable `utf-8` using it? It is easy to make perl output utf-8, but it seems to be difficult to make console display it well. – bvr Feb 11 '11 at 16:04
  • 1
    @bvr: "chcp 65001" enbables UTF-8 but it doesn't seem to be well supported. It causes oddly broken output from Perl and it causes Python to crash! – hippietrail Feb 11 '11 at 16:13
  • @hippietrail Interesting - I am getting good beginning of the string with `chcp 65001` and perl outputting utf8, but it seems to be improperly ended (some part of previous string repeated after the line end) – bvr Feb 11 '11 at 16:32
  • 2
    @bvr: Yes I get the same thing. I'm not sure if it's 100% Windows's fault or some interaction between Windows and Perl though I've assumed it's the former. I'm pretty certain it's due to string functions assuming the number of bytes would equal the number of characters. – hippietrail Feb 11 '11 at 16:44
  • 1
    @hippietrail I found the method that works correctly - using Win32::Unicode module. Added into my answer an example. – bvr Feb 12 '11 at 19:43
  • @bvr: Thanks for finding Win32::Unicode::Console::printW() - it's just what I needed for Perl - now I only have to find a solution for Python. – hippietrail Feb 13 '11 at 04:46
  • @hippietrail There is even `Win32::Unicode::Native` that replaces `print` with unicode version, so it works transparently. Overall thanks for bringing up this issue, I already used that in one script of mine. – bvr Feb 14 '11 at 20:06
  • For those who like to use colors in their output, unfortunately there seems to be some incompatibilities between modules `Win32::Console::ANSI` and `Win32::Unicode`. Because when you `use` both, colors won't be shown and ANSI codes will be shown instead. – Stamm Jan 27 '12 at 16:42
7

I have to unask many of your questions.

Did you know that

  • Windows uses UTF-16 for its APIs, but still defaults to the various "fun" legacy encodings (e.g. Windows-1252, Windows-1251) in userspace, including file names, differently for the many localisations of Windows?
  • you need to encode output, and picking the appropriate encoding for the system is achieved by the locale pragma, and that there is the a POSIX standard called locale on which this is built, and Windows is incompatible with it?
  • Perl already supported the so-called "wide" APIs once?
  • Microsoft managed to adapt UTF-8 into their codepage system of character encoding, and you can switch your terminal by issuing the appropriate chcp 65001 command?
Community
  • 1
  • 1
daxim
  • 39,270
  • 4
  • 65
  • 132
  • 1
    The legacy API functions are still available, but they do nothing else than converting strings to and from UTF-16 and calling the UTF-16 functions. Any sane Windows application uses the UTF-16 functions directly nowadays. – Philipp Feb 09 '11 at 10:41
  • 1
    I do know that Windows uses UTF-16 for its APIs but you're wrong about the legacy encodings. They are not the default at all but only to support legacy stuff. Everything is UTF-16 internally including filenames except on legacy filesystems. – hippietrail Feb 09 '11 at 12:57
  • @hippietrail: My comment was meant to be an addition to the phrase “but still defaults to the various "fun" legacy encodings (e.g. Windows-1252, Windows-1251) in userspace”, which I think is not entirely correct because the legacy functions are not more default than the UTF-16 ones. – Philipp Feb 09 '11 at 13:08
  • 2
    I don't know how so much erroneous mis-information can result in 6 up-votes! – David Heffernan Feb 09 '11 at 16:15
  • 2
    * Did you know that Windows is officially POSIX compliant? * Did you know that codepage 65001 is totally broken in the console still in Windows 7? Perl kinda works with it but there seems a be a bug with character length vs. byte length which results in extra blank lines and ends of long lines being output a second time. And Python simply crashes. If it did work I would regard it as a useful workaround but not a true solution to outputting Unicode from so-called cross-platform scripting languages. – hippietrail Feb 11 '11 at 10:12
  • Either codepage 65001 isn't exactly the same as unicode or the default console font for it on the Chinese version of windows 7 has a few characters that are wrong. – Jeremy List May 14 '15 at 03:46
5

Michael Kaplan has series of blog posts about the cmd console and Unicode that may be informative (while not really answering your question):

PS: Thanks @Jeff for finding the archive.org links.

Community
  • 1
  • 1
Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
  • Michael Kaplan's blog has been disappeared by Microsoft. Here are corresponding archives: - [Conventional wisdom is retarded, aka What the @#%&* is _O_U16TEXT?](https://web.archive.org/web/20130101094000/http://blogs.msdn.com/b/michkap/archive/2008/03/18/8306597.aspx) - [Anyone who says the console can't do Unicode isn't as smart as they think they are](https://web.archive.org/web/20130519074717/http://blogs.msdn.com/b/michkap/archive/2010/04/07/9989346.aspx) – Jeff Apr 28 '14 at 16:44
  • (continuing, comment was too long) - [A confluence of circumstances leaves a stone unturned...](https://web.archive.org/web/20130620152913/http://blogs.msdn.com/b/michkap/archive/2010/09/23/10066660.aspx) – Jeff Apr 28 '14 at 16:47
4

Are you sure your script would output Unicode on some other platform correctly? "wide character in print" warning makes me very suspicious.

I recommend to look over this overview

w.k
  • 8,218
  • 4
  • 32
  • 55
  • 2
    This is actually a valid response. If you get a "wide character in print" warning from Perl, your code is incorrect and broken on all systems. – hobbs Feb 09 '11 at 10:07
  • 1
    Well if I know I'm printing to a UTF-8 console, like is possible on *nix I can do "binmode STDOUT, ':utf8'" but on Windows even though the code "binmode STDOUT, ':utf16'" doesn't throw any errors it doesn't work either. In cross-platform code things are therefore in a very untenable position unless you have an actual fix to suggest. – hippietrail Feb 11 '11 at 09:49
3

Why on earth after all these years do they not just simply call the Win32 -W APIs that output UTF-16 Unicode instead of forcing everything through the ANSI/codepage bottleneck?

Because Perl and Python aren't Windows programs. They're Unix programs that happen to have been mostly ported to Windows. As such, they don't like to call Win32 functions unless necessary. For byte-based I/O, it's not necessary; this can be done with the Standard C Libary. UTF-16-based I/O is a special case.

Or are the -W APIs inherently broken to such a degree that they can't be used as-is?

I wouldn't say that the -W APIs are inherently broken as much as I'd say that Microsoft's approach to Unicode in C(++) is inherently broken.

No matter how much certain Windows developers insist that programs should use wchar_t instead of char, there are just too many barriers to switching:

  • Platform dependence:
    • The use of UTF-16 wchar_t on Windows and UTF-32 wchar_t elsewhere. (The new char16_t and char32_t types may help.)
    • The non-standardness of UTF-16 filename functions like _wfopen, _wstat, etc. limits the ability to use wchar_t in cross-platform code.
  • Education. Everbody learns C with printf("Hello, world!\n");, not wprintf(L"Hello, world!\n");. The C textbook I used in college never even mentioned wide characters until Appendix A.13.
  • The existing zillions of lines of code that use char* strings.
dan04
  • 87,747
  • 23
  • 163
  • 198
  • It's certainly evident that Perl and Python are ports from *nix but on Python's own website, www.python.org, they don't play down their Windows support, in fact they list it first! "Python runs on Windows, Linux/Unix, Mac OS X, and has been ported to the Java and .NET virtual machines." (Perl's website is less bold). Perhaps they should be a little more modest and admit that Windows is a second class citizen or make the effort to call iconv() / WideCharToMultiByte() / MultiByteToWideChar() at the edges where text moves between the OS and the interpreter. – hippietrail Feb 13 '11 at 10:37
  • 1
    I have to confess that I always thought `_wfopen` meant something rather more, um, *expletive* in nature. ☺ – tchrist Feb 14 '11 at 15:20
2

For Python, the relevant issue in tracker is http://bugs.python.org/issue1602 (as said in comments). Note that it is open for 7 years. I tried to publish a working solution (based on information in the issue) as a Python package: https://github.com/Drekin/win-unicode-console, https://pypi.python.org/pypi/win_unicode_console.

user87690
  • 687
  • 3
  • 25
2

For Perl to fully support Windows in this way, every call to print printf say warn and die has to be modified.

  • Is this Windows?
  • Which version of Windows? Perl still mostly works on Windows 95
  • Is this going to the console, or somewhere else.

Once you have that determined, you then have to use a completely different set of API functions.

If you really want to see everything involved in doing this properly, have a look at the source of Win32::Unicode::Console.


On Linux, OpenBSD, FreeBSD and similar OS's you can usually just call binmode on the STDOUT and STDERR file handles.

binmode STDOUT, ':utf8';
binmode STDERR, ':utf8';

This assumes that the terminal is using the UTF-8 encoding.

Brad Gilbert
  • 33,846
  • 11
  • 78
  • 129
  • Well in the same way that some people may theoretically run Perl on Windows 95 without full wide function support, some people may theoretically be running *nix with terminals set to some other encoding, especially Japanese users. In this case just calling binmode won't be enough. I would expect Perl could just call wprintf and that the C library correctly handles the console UTF-16 and redirecting. If the C libraries are broken then I would release Perl from any blame of course. – hippietrail Feb 14 '11 at 02:06
0

Unicode issues in Perl

covers how the Win32 console works with Perl and the transcoding that happens behind the scene from ANSI to Unicode;albeit not just a Perl issue but affects other languages

nikosv
  • 1