5

I've used a good few programming languages over the years and I'm an armchair linguist and contributor to Wiktionary. I've been making some of my own tools to look up Wiktionary from the commandline but I've run into a surprising problem.

Neither Perl nor Python can output Unicode to the console natively under both *nix and Windows (though there are various workarounds). The main reason is that *nix OSes like their Unicode in UTF-8 and Windows likes its Unicode in UTF-16. But it also seems that Windows makes it very difficult to use wide characters with the console even though both the console and wprintf are wide character native.

So the question is, is the situation any better if I look beyond these languages into Java, C#, Scala, etc. Or are there any scripting languages which started out on Windows and were then ported to *nix?

Here is some ideal pseudocode:

function main()
{
    print( L"hello, 世界" );
}
Community
  • 1
  • 1
hippietrail
  • 15,848
  • 18
  • 99
  • 158
  • 3
    The ultimate answer is that any language would ultimately need to call `WriteConsoleW` instead of `WriteFile`, breaking an abstraction barrier... so it's not really a language issue, but a library design issue. – user541686 Feb 17 '11 at 08:03
  • I might be inclined to investigate Java, since it was originally aimed to at platform independence and the string handling was built around unicdode. Source files are unicode, so your ideal psuedo code might just compile with a bit of tweaking. – Jimmy Feb 17 '11 at 08:14
  • @Mehdrad: or Microsoft can possibly fix wprintf et al so you can print wide character strings directly without conversion, unless this is a bug in the specification of C's wprintf or POSIX locales or something? Alternatively, the programming languages could add an abstraction layer between their print function and WriteFile/WriteConsoleW or whatever API they rely on. – hippietrail Feb 17 '11 at 08:21
  • @Jimmy: It seems that Java also does a lossy conversion from wide characters to ANSI and back to wide characters for the Windows console, at least in 2009: http://illegalargumentexception.blogspot.com/2009/04/java-unicode-on-windows-command-line.html – hippietrail Feb 17 '11 at 08:31
  • 1
    Unicode in the Windows console is hard. Not as hard as the article you link to makes it look, but not easy. And one of the problems is the font support. Even if you get the right incantation for wprintf to work, you will see squares instead of Chinese characters. So it is not so much the programming language as the medium used for output. You might consider some kind of graphical console (like for instance the "Windows PowerShell ISE") – Mihai Nita May 28 '11 at 09:36
  • @Mihai Nita: Yes the console font in the other half of the problem. On one hand it is more trivial to understand but on the other hand there's really no current fix in the case of CJK characters even when selecting a TrueType font works for many other languages. This is purely Microsoft's fault and only they could fix it. But they can fix the `WriteFile()` UTF-8 bug too if they want to. \-: – hippietrail May 28 '11 at 10:42
  • 3
    The basic problem is that the Windows console model is broken. The console, instead of just being a normal file handle, is a special device with a different API that doesn't adapt all that well to being made to look like a normal file handle. For example some strangeness can be observed If you `SetOutputCP(CP_UTF8)` and then try to write UTF-8 data to the console in different ways. UTF-8 output works via fputs and maybe some other APIs, but you can't write the bytes individually the way std::cout does. – bames53 Nov 30 '11 at 17:15
  • Well yes @bames53 but UTF-8 is not a fully supported codepage on Windows. Windows believes "Unicode" is a synonym for "UTF-16" but it also is not without issues on the console, though better then UTF-8. – hippietrail Nov 30 '11 at 22:21
  • @hippietrail What I'm trying to show here isn't necessarily about UTF-8. It's just an example of how messed up the console is, because if you write to a file instead of the console you get exactly the right output. The broken console leads to the differences between what is output when you print via fputs vs. std::cout, even though in all cases the exact same bytes are being written. Of course, Windows does have enough UTF-8 support for the console to take it in _some_ cases. The fact that that's not enough for _all_ cases is another way to see the same brokenness. – bames53 Nov 30 '11 at 22:42
  • Yes it's totally true what you're saying that the Windows console is very broken. You can still target the console but you have to be very aware what's wrong with it when you try, and that it will involve extra work and that that extra work is pretty much in the realm of hacks )-: – hippietrail Nov 30 '11 at 23:11
  • 1
    note: [`win-unicode-console` package](http://stackoverflow.com/a/30551552/4279) may call `WriteConsoleW()` for your transparently without modifying your Python script (`print(u"hello, 世界")`). – jfs Aug 24 '15 at 00:22

4 Answers4

2

Does any language do Unicode and cross-platform properly and fully?

C# supports Unicode very extensively. Its standard library (.NET Framework) also has outstanding support for Unicode. Cross-platform is reasonable, but not perfect: it's achieved via Mono, and on mobile platforms via Xamarin.

Command-line programs are pretty portable but can get screwed by ancient relics, like SSH terminals that haven't been updated for a decade or more.

Here is some ideal pseudocode:

C# gets pretty close:

using System;
class Program
{
    static void Main(string[] args)
    {
        Console.OutputEncoding = System.Text.Encoding.UTF8;
        Console.WriteLine("tést, тест, τεστ, ←↑→↓∏∑√∞①②③④, Bài viết chọn lọc");
    }
}

Screenshot of the output (use Consolas or another font that has all the above characters):

proof

Of course C# is not a scripting language; it is quite different in its approach to pretty much everything.

Community
  • 1
  • 1
Roman Starkov
  • 59,298
  • 38
  • 251
  • 324
0

AFAIK almost all scripting languages started in the Unix world and were then ported to Windows. I don't know any example of a (scripting) language that started on Windows... One scripting language that seems to do pretty fine with Unicode these days is Ruby.

DarkDust
  • 90,870
  • 19
  • 190
  • 224
  • The only scripting language I could think of that started on Windows is Windows PowerShell but unlike Perl and Python it seems much more targeted to scripts than programs, and it's very arcane (-: – hippietrail Feb 17 '11 at 08:27
  • 1
    It does seem to have some nifty features, though (like the piping of objects)... but it's *only* available on Windows, so doesn't count :-) – DarkDust Feb 17 '11 at 08:38
  • Actually there is a PowerShell for *nix, it's called Pash but not being a PowerShell guy I haven't tried it: http://pash.sourceforge.net/ – hippietrail Feb 17 '11 at 09:11
  • Ruby garbles the output of UTF-8 under codepage 65001 due to the Windows `WriteFile` bug. I don't think it supports direct output of UTF-16 at all. – hippietrail Apr 18 '11 at 06:07
0

Eight and a half years have passed and things are improving.

  • NodeJS was the first language to "just work" out of the box with Unicode on *nix, Mac, and Windows using Unicode in their terminals/consoles without regard to whether the OS preferred UTF-8 or UTF-16.

  • At the time I asked this question, this did not work for Perl, Python, or Ruby. I'm not sure about PHP. But at least the Python developers eventually took the relevant bug report / feature request seriously and put some work into it. Python has now worked with cross-platform in-terminal Unicode for some years.

  • I just started looking at Rust, and thought to check this. I was very pleasantly surprised that they also took this issue seriously and Rust is the first low-level / non-scripting language to just work out of the box cross-platform with Unicode in the console on Mac, Windows, and *nix.

hippietrail
  • 15,848
  • 18
  • 99
  • 158
-1

Perhaps this is one of the workarounds you hinted at, but: You can chcp 65001 in a 'DOS box' with a non-raster font select and view UTF-8 output from scripts (or programs) that run unchanged under Unix or Windows. The price to pay is that .bat/.cmd files won't execute.

Ekkehard.Horner
  • 38,498
  • 2
  • 45
  • 96
  • 2
    "chcp 65001" sets the "ANSI" encoding to UTF-8 so you can use WriteConsoleA with a UTF-8 string as well as WriteConsoleW with a UTF-16 string. In practice it seems to be poorly supported. It causes Python to crash and Perl to output artifacts that look like they stem from the difference in the character length and byte length of UTF-8 strings. – hippietrail Feb 17 '11 at 08:25
  • I've investigated this further and there is a bug in Windows's `WriteFile()` API where it returns the number of characters under codepage 65001 instead of the documented number of bytes. This is the cause of the `chcp 65001` not working under Perl, PHP and Ruby on Windows. Python suffers from its own separate bug. – hippietrail Apr 18 '11 at 06:05