21

Char is the type for Unicode characters in Haskell, and String is simply [Char] (i.e. a list of Char items). Here is some simple code:

main = putStrLn "©" -- Unicode string

This code compiles fine, but I get the runtime exception when I run it in the PowerShel.exe or cmd.exe:

app.exe: : commitBuffer: invalid argument (invalid character)

Why does this happen? Weirdly enough, when I do the same in C#, I get no exception:

Console.WriteLine("©");

In .NET, chars are Unicode too. PowerShell or cmd prints c instead ©, but at least I get not exception. How can I get my Haskell executable to run smoothly?

jub0bs
  • 60,866
  • 25
  • 183
  • 186
Andrey Bushman
  • 11,712
  • 17
  • 87
  • 182
  • Might be that Haskell requires that program to be ran in the unicode shell. – Bartek Banachewicz Dec 23 '14 at 08:34
  • My cmd shell prints `"©"` fine but chokes with the same error on `"ഠഃ അ ഠൃ ൩"`. – chi Dec 23 '14 at 10:14
  • Possibly useful: http://stackoverflow.com/questions/22349139/utf8-output-from-powershell I'm no PowerShell or C# expert, but the fact that some character substitution occurs ("c" instead of "©") when you run your C# program may indicate that PowerShell isn't set to use UTF-8... @chi That Unicode string prints out fine on Mac OS X; I use bash via Terminal, which is set to use UTF-8. – jub0bs Dec 23 '14 at 11:06
  • 2
    @Jubobs Indeed, on linux the terminal is set to UTF-8 as well, and I never had issues there. @Bush If all you want is avoid exceptions, you can use `chcp 65001` in the terminal -- all non ascii characters will be unreadable, though. – chi Dec 23 '14 at 11:54
  • @chi > *and I never had issues there*.
    because you didn't use cyrillic. Many distribution kits of Linux writes a garbage by default in the terminal instead of cyrillic chars.
    – Andrey Bushman Dec 23 '14 at 11:58
  • @Bush True, unicode support used to be horrible in the past, and possibly even right now on some distros. I did a quick test using Ubuntu 14.04, and was able to output cyrillic, chinese, arabic, and hebrew text samples (albeit only in left-to-right mode). Emacs also reacted to RTL scripts by correctly reversing the input direction (e.g. Del erases from the left, backspace from the right). The output on the terminal looks fine (as far as I can see) except for Hindi where some combined chars were split (they look fine in Emacs, though). – chi Dec 23 '14 at 12:25
  • @chi, I wrote you about the *terminals*, but not about the *terminal emulators*. Terminal emulator writes cyrillic chars fine. Did you try to do it in the terminal, instead of terminal emulator? – Andrey Bushman Dec 23 '14 at 12:31
  • 1
    @Bush Have you set your code page using [`chcp.com 65001`](http://stackoverflow.com/q/25373116/839246)? – bheklilr Dec 23 '14 at 14:13
  • @bheklilr, thank you! Now it works without exception. – Andrey Bushman Dec 23 '14 at 14:23
  • Oh, @Jubobs wrote about this too, but I didn't see it. – Andrey Bushman Dec 23 '14 at 14:25

2 Answers2

9

On Windows, the fix is to tell the shell to use code page 65001 (instructions here), which puts Windows in "UTF-8 mode". It's not perfect, but for most characters you should see unicode characters handled much better.

Community
  • 1
  • 1
bheklilr
  • 53,530
  • 6
  • 107
  • 163
  • 7
    The other half of the question is "why does a GHC binary crash rather than gracefully fall back to non-Unicode output like .NET binaries seem to?" GHC can [figure out the system locale](http://hackage.haskell.org/package/base-4.7.0.2/docs/GHC-IO-Encoding.html#v:getLocaleEncoding); we should theoretically be able to do the encoding conversion and avoid crashing. I wonder if anybody's looked into that. – Christian Conkle Dec 25 '14 at 01:50
8

I think this should count as a bug in GHC, but there is a workaround. The default encoding for all handles in a GHC program (except those opened in Binary mode) is just the encoding accepted by the console with no error handling. Fortunately you can add error handling with something like this.

makeSafe h = do
  ce' <- hGetEncoding h
  case ce' of
    Nothing -> return ()
    Just ce -> mkTextEncoding ((takeWhile (/= '/') $ show ce) ++ "//TRANSLIT") >>=
      hSetEncoding h

main = do
  mapM_ makeSafe [stdout, stdin, stderr]
  -- The rest of your main function.
Jeremy List
  • 1,756
  • 9
  • 16
  • Thank you. I have not exception now, but still I have not the same what I expected. I have got the `. ? First Second, 2014` output instead of `© First Second, 2014`. – Andrey Bushman Dec 26 '14 at 06:16
  • 1
    It's adding "?" because the encoding used by your console doesn't have the "©" character, but I've never seen it add "" before and I don't know what's going on there. You can also combine this answer with @bheklilr's answer to change your console's encoding to something that has the character you need (codepage 65001 uses the same method as utf-8 for noting character size, but unfortunately it can only be called utf-8 if you don't care what characters are actually displayed) – Jeremy List Dec 26 '14 at 06:30
  • The `.` exists when I load my code into *ghci* and run the `main` function manually. If I compile my code as exe file, I have not the ` text. Thank you. – Andrey Bushman Dec 26 '14 at 06:42