1

I am trying to rebuild a tee-like util by go language on Windows. But I found the encoding of the output is not always the same.

To make the problem simple, I wrote this program:

package main

import (
    "fmt"
    "io"
    "os"
)

func main() {
    count, err := io.Copy(os.Stdout, os.Stdin)
    fmt.Println(count, err)
}

I named it test. In the Windows command console, I got these output:

>test
中
中
5 <nil>

It works fine with no pipe and redirect.

>echo 中 | test
��
5 <nil>

The output is collapsed if I get stdin from a pipe.

>echo 中 | test > test.txt

>type test.txt
中
5 <nil>

It works again when I redirect the output to a file.

>test > test.txt
中

>type test.txt
荳ュ
5 <nil>

But not work when I use the normal stdin and redirect to a file. If I open this test.txt here by other editors like notepad++, I found it is encoded in UTF-8 and the content is .

If I use Cygwin with a UTF-8 encoded console on Windows, everything is just good.

From the output, I know that the number of bytes the program copied is 5, which means it is using UTF-8 in the program no matter what the stdin is. But as I know the windows command line console is basically use non-unicode encoding, why it is converted into UTF-8? And is there a way to let the program just copy what the stdin send without any converting?

btw. If I use tee from gnuWin32 to do the same test, everything just works good.

>where tee
D:\Tools\gnuWin32\bin\tee.exe

>echo 中 | tee
中

>tee tee.txt
中
中
^C
>type tee.txt
中

Is there anyone know the reason of this and what is the solution?

Programus
  • 338
  • 2
  • 16
  • 3
    The output of your program is the same as its input, no encodings are changing, but the windows console displays it incorrectly due to some weirdness. I forget the details, and I also forget where to find them, but the win32 version of Perl had to deal with the same thing. Windows tries to map console output to utf16 or something like that. http://stackoverflow.com/questions/14109024/how-to-make-unicode-charset-in-cmd-exe-by-default is related however. – hobbs Sep 18 '15 at 03:16
  • 2
    I think [this](http://stackoverflow.com/questions/3130979/how-to-output-unicode-strings-on-the-windows-console?rq=1) is the real root — programs that use stdio on Windows simply don't get to have non-garbled console output; they have to know that they're printing to the console and use `WriteConsoleW` from the win32 API. – hobbs Sep 18 '15 at 03:21
  • Thank you for your comment, @hobbs. I read the link, but I still cannot understand why it works when I just type into the stdin. If it works with UTF-8, it should also convert what I typed into UTF-8 and output a garbled text instead. And also, I cannot understand why the file is just non-unicode when I used a pipe input while utf-8 encoded when I typed the stdin. – Programus Sep 18 '15 at 03:41
  • You might find [this discussion](https://groups.google.com/d/topic/golang-nuts/our8DRS9gaU/discussion) enlightening -- it deals extensively on Windows, its approach to encoding on the console and Go. – kostix Sep 18 '15 at 17:27

1 Answers1

0

it not use utf8, why 5 bytes wrote is because there a space(0x20) after 中

C:\Users\jan>echo 中| go run src/main.go
00000000  d6 d0 0d 0a                                       |....|
��
4 <nil>

so in my system, console not use utf8, but GBK.

the bug is because windows console can not change the on screen character even the appended byte make the character another one. e.g. 'd6 d0' is 中, d6 already on screen as �, 0a appended, not make the two byte be one display character.

for testing, i have a c# console program

static void Main(string[] args)
    {

        using (Stream stdout = Console.OpenStandardOutput())
        {
            stdout.WriteByte((byte)'A');
            stdout.WriteByte(0xd6);
            stdout.WriteByte(0xd0);
        }

        using (Stream stdout = Console.OpenStandardOutput())
        {
            stdout.WriteByte((byte)'B');
            stdout.WriteByte(0xd6);

        }

        using (Stream stdout = Console.OpenStandardOutput())
        {
            stdout.WriteByte(0xd0);
        }
    }

get result:

A中BPress any key to continue . . .

so I guess windows libc have a buffer before stdout, it make up two bytes be one character and print to console.

the interesting thing i found is that, even if windows console in gbk page, go lang can write stdou with utf8 encoding. seems bytes wrote to os.Stdout not directly passed to console.

package main

import (
    "fmt"
    "os"
)

func main() {
    os.Stdout.Write([]byte{0xe4,0xb8,0xad})
    fmt.Print("\xe4")
    fmt.Print("\xb8")
    fmt.Println("\xad")
}

got:

C:\Users\jan>go run src/main.go
中中

C:\Users\jan>
Jiang YD
  • 3,205
  • 1
  • 14
  • 20
  • Thank you for your answer. You make the problem clear for me. I think I should convert the output into utf-8 if I want the console display the character correctly. I wonder whether there is a way to let go language just pass the original input into output without any conversion. Because I am not sure what encoding would be for the input stream, it is hard to convert the encoding in the program. – Programus Sep 18 '15 at 05:35