9

I am encountering a strange problem in printing Unicode strings to the Windows console*.

Consider this text:

אני רוצה לישון

Intermediary

היא רוצה לישון
אתם, הם
Bye
Hello, world!
test

Assume it's in a file called "file.txt".

When I go*: "type file.txt", it prints out fine. But when it's printed from a Perl program, like this:

 use strict;
 use warnings;
 use Encode;
 use 5.014;
 use utf8;
 use autodie;
 use warnings    qw< FATAL  utf8     >;
 use open        qw< :std  :utf8     >;
 use feature     qw< unicode_strings >;
 use warnings 'all';

 binmode STDOUT, ':utf8';   # output should be in UTF-8
 my $word;
 my @array = ( 'אני רוצה לישון', 'Intermediary',
    'היא רוצה לישון', 'אתם, הם', 'Bye','Hello, world!', 'test');
 foreach $word(@array) {
    say $word;
 }

The Unicode lines (Hebrew in this case) show up again each time, partially broken, like this:

E:\My Documents\Technical\Perl>perl "hello unicode.pl"
אני רוצה לישון
לישון
�ן

Intermediary
היא רוצה לישון
לישון
�ן

אתם, הם
�ם

Bye
Hello, world!
test

(I save everything in UTF-8).

This is mighty strange. Any suggestions?

(It's not a "Console2" problem* - the same problem shows up on a "regular" windows console, only there you don't see the Hebrew glyphs).


* Using "Console" (also called "Console2") - it's a nice little utility which enables working with Unicode with the Windows console - see, for example, here: http://www.hanselman.com/blog/Console2ABetterWindowsCommandPrompt.aspx

** Note: at the console, you have to say, of course:

chcp 65001
sarnold
  • 102,305
  • 22
  • 181
  • 238
Helen Craigman
  • 1,443
  • 3
  • 16
  • 25
  • Oof. Please just use the four-space code formatting rather than all those `
    `, `` tags all over the place.
    – sarnold Feb 21 '12 at 01:08
  • 1
    These **are** Unicode strings, represented in UTF-8. Please cancel your -1. – Helen Craigman Feb 21 '12 at 01:11
  • 1
    Sorry to waste all the effort escaping code, but in Markdown, all you have to do to format code is indent it with 4 spaces. You can also just press the code button `{}` in the editor toolbar. (@sarnold: Fixed!) – Ry- Feb 21 '12 at 01:12
  • Got it. @sarnold: please explain: "four space code formatting"? – Helen Craigman Feb 21 '12 at 01:13
  • 1
    Helen: take a look at how @minitech reformatted your post in the [revisions](http://stackoverflow.com/posts/9370720/revisions) (available on every post via the "edited N minutes ago" link) -- it's far easier to modify in the future, copy-and-paste elsewhere, and uses a neater formatting style. Minitech, many thanks. Again. :) – sarnold Feb 21 '12 at 01:17
  • 1
    Just put at least four spaces in front of a line. The markdown parser will think it's something to present in the code style. Generally, I type out my post in a text editor then shift the code bits over one indent level before I paste it in. – brian d foy Feb 21 '12 at 01:18
  • Is your file actually saved as UTF-8? You've told Perl that it is with the `utf8` pragma, but if it's not actually encoded as that, things might get messed up. – brian d foy Feb 21 '12 at 01:20
  • Affirmative. The file **is** actually saved in UTF-8. In addition, "binmode STDOUT, ':utf8';" makes sure it is. – Helen Craigman Feb 21 '12 at 01:23
  • @briandfoy How do you shift the code bits in your text editor? – TLP Feb 21 '12 at 01:31
  • @TLP - depends on code editor. In mine (UltraEdit), you select the text and press "TAB" key or an "Indent" icon. I'm sure Emacs and vi have some easy way as well. – DVK Feb 21 '12 at 04:17
  • @DVK I use vim, and I've been looking for a shortcut to add/remove indent. – TLP Feb 21 '12 at 04:22
  • @DVK Ah, cool, guess I just needed a quick google session. `<` and `>`, how simple. – TLP Feb 21 '12 at 04:25
  • 1
    @TLP - if only there was a web site where a person could ask a programming related [question](http://stackoverflow.com/questions/235839/how-do-i-indent-multiple-lines-quickly-in-vi)... :))))) ( the highest voted answer lists `>` command ) – DVK Feb 21 '12 at 04:25
  • @DVK Awesome! That post was pure gold. – TLP Feb 21 '12 at 04:31

4 Answers4

5

Did you try the solution from perlmonk ?

It use :unix as well to avoid the console buffer.

This is the code from that link:

use Win32::API;

binmode(STDOUT, ":unix:utf8");

#Must set the console code page to UTF8
$SetConsoleOutputCP= new Win32::API( 'kernel32.dll', 'SetConsoleOutputCP', 'N','N' );
$SetConsoleOutputCP->Call(65001);

$line1="\x{2554}".("\x{2550}"x15)."\x{2557}\n";
$line2="\x{2551}".(" "x15)."\x{2551}\n";
$line3="\x{255A}".("\x{2550}"x15)."\x{255D}";
$unicode_string=$line1.$line2.$line3;

print "THIS IS THE CORRECT EXAMPLE OUTPUT IN PURE PERL: \n";
print $unicode_string;
J-16 SDiZ
  • 26,473
  • 4
  • 65
  • 84
  • Wow - lots of interesting stuff. I will study it, try it and report back. – Helen Craigman Feb 21 '12 at 01:52
  • She already sets the console to cp 65001, so what's new in that post? – ikegami Feb 21 '12 at 19:06
  • @J-16-SDiZ Thanks guys - it worked beautifully. It still threw the error: `"Global symbol "$SetConsoleOutputCP" requires explicit package name"`, but that has been easily corrected by replacing it with: `Win32::$SetConsoleOutputCP`. – Helen Craigman Feb 21 '12 at 19:50
  • @ikegami: it's not sufficient to chcp 65001. The Windows console buffer and the Perl buffering do not agree, and you must specify: `binmode(STDOUT, ":unix:utf8");` (instead of: `binmode(STDOUT, ":utf8");` ) for Perl Unicode output to Windows console to work. – Helen Craigman Feb 21 '12 at 19:57
  • 1
    --Previous comment should go: `$Win32::SetConsoleOutputCP` and not: `Win32::$SetConsoleOutputCP`. – Helen Craigman Feb 21 '12 at 20:00
3

Guys: continuing on studying that Perlmonks post, turns out that this is even neater and nicer: replace:
use Win32::API;
and:

$SetConsoleOutputCP= new Win32::API( 'kernel32.dll', 'SetConsoleOutputCP', 'N','N' );
$SetConsoleOutputCP->Call(65001);

with:

use Win32::Console;

and:

 Win32::Console::OutputCP(65001);

Leaving all else intact.
This is even more in the spirit of Perl conciseness and magic.

Helen Craigman
  • 1,443
  • 3
  • 16
  • 25
  • In addition one must change the font of cmd.exe to "Consolas" to be able to see the unicode characters. – asmaier Jan 15 '17 at 12:26
1

You can also utilize Win32::Unicode::Console or Win32::Unicode::Native to achieve unicode prints on windows console.

bvr
  • 9,687
  • 22
  • 28
0

Also, this behaviour is not present while using ConEmu, which also enables proper Unicode support in Windows' command console.

circulosmeos
  • 424
  • 1
  • 6
  • 19