3

I realized that accents in my texts get converted to �. I boiled it down, to the following example, which writes (and overwrites) the file test.txt.

It uses exclusively methods from Data.Text, which are supposed to handle unicode texts. I checked that both the source file as well the output file are encoded in utf8.

{-# LANGUAGE OverloadedStrings #-}

import Prelude hiding (writeFile)
import Data.Text
import Data.Text.IO

someText :: Text
someText = "Université"

main :: IO ()
main = do 
    writeFile "test.txt" someText

After running the code, test.txt contains: Universit�. In ghci, I get the following

*Main> someText
"Universit\233"

Is this already encoded incorrectly? I also found a comment on � in https://hackage.haskell.org/package/text-1.2.2.2/docs/Data-Text.html, but I still do not know how to correct the example above.

How do I use accents in an OverloadedString and correctly write them to a file?

mna
  • 263
  • 2
  • 8
  • Strings (and Text as well, I believe) in GHCi are printed after escaping "funny" chars: this is done as if the user typed `putStrLn (show string)` where `show` does the escaping and adds quotes. You can print the naked string/text by `putStrLn string` (remember to use `Data.Text.putStrLn` for text, instead of the prelude one) – chi Aug 27 '17 at 12:57
  • putStrLn in ghci shows the accent correctly, so it must be writeFile? – mna Aug 27 '17 at 13:56

1 Answers1

7

This has nothing to do with Data.Text, and certainly not with OverloadedStrings – both handle UTF-8–Unicode just fine.

However Data.Text.IO will not write a BOM or anything that indicates the encoding, i.e. the file really just contains the text as-is. On any modern system, this means it will be in raw UTF-8 form:

sagemuej@sagemuej-X302LA:~$ xxd test.txt 
00000000: 556e 6976 6572 7369 74c3 a9              Universit..
sagemuej@sagemuej-X302LA:~$ cat test.txt 
Université

So depending on what editor you open the file with, it may guess a wrong encoding, and that's apparently your issue. On Linux, UTF-8 has long been the standard, so no issue here, but Windows isn't so up-to-date. It should be possible to manually select the encoding in any editor, though.

In fact, Data.Text.IO.writeFile will use your locale to decide how to encode the file. Everybody should have UTF-8 as their locale nowadays, if you don't please change that.

To get a BOM in your file and thus preclude such issues, use utf8_bom.

Regarding the output you see in GHCi: that's the Show instance at work; it escapes any string-like values to the safest conceivable form, i.e. anything that's not ASCII to an escape sequence, which for 'é' happens to be '\233'. Again not specific to Text, in fact you get this even for single characters:

Prelude> 'é'
'\233'
Prelude> putChar '\233'
é

This escaping never happens when you use the direct-IO-output actions for your string types, i.e. putChar, putStr or putStrLn.

Prelude> import qualified Data.Text.IO as Txt
Prelude Txt> Txt.putStrLn "Université"
Université
leftaroundabout
  • 117,950
  • 5
  • 174
  • 319
  • I opened test.txt in an editor and set the encoding manually to utf8. Still I do not get the accent. How can I tell what encoding writeFile uses? – mna Aug 27 '17 at 13:53
  • I thought it was always UTF-8, but [as per the documentation](http://hackage.haskell.org/package/text-1.2.2.2/docs/Data-Text-IO.html#g:2) it actually uses your locale to decide. I strongly recommend you set your locale to UTF-8 and never need to worry about it again. Alternatively you can of course create a bytestring in any encoding you like (just, nowadays [you should _never_ use anything but UTF-8](http://utf8everywhere.org/), so...). – leftaroundabout Aug 27 '17 at 14:05
  • Indeed, my locale for my stack haskell installation is not set to utf8. I have no idea how to change that, but one can use "setLocaleEncoding utf8" to change it in the code. – mna Aug 27 '17 at 14:47