3

I'm having some troubles with text encoding. Parsing a website gives me a Data.Text string

"Project - Fran\195\167ois Dubois",

which I need to write to a file. So I'm using Data.Text.Lazy.Encoding.encodeUtf8 to convert it into a Bytestring. The problem is that this yields garbled output:

"Project - François Dubois".

What am I missing here?

Peter
  • 1,693
  • 13
  • 18
  • 1
    How are you viewing the ByteString output - that will have the clue to why you are seeing data thus. – Gangadhar Apr 08 '12 at 05:06
  • I write the ByteString to a file via writeFile. The output looks that way no matter how I open the file (using less or gvim) The file is then converted to a PNG via graphviz where the garbled output persists. – Peter Apr 08 '12 at 05:51
  • 1
    I think the garbling is not happening when you convert to utf-8-ByteString. I think the garbling happens when you parse the web page, or perhaps it is in the web page to start with. Could you give us some details of how you're downloading the web page and extracting that value? – dave4420 Apr 08 '12 at 07:46

2 Answers2

5

If you have gotten Fran\195\167ois inside your Data.Text, you already have a UTF-8-encoded François.

That's inconvenient because Data.Text[.Lazy] is supposed to be UTF-16 encoded text, and the two code units 195 and 167 are interpreted as the unicode code points 195 resp. 167 which are 'Ã' resp. '§'. If you UTF-8-encode the text, these are converted to the byte sequences c383 ([195,131]) resp c2a7 ([194,167]).

The most likely way for getting into this situation is that the data you got from the website was UTF-8 encoded, but was interpreted as ISO-8859-1 (Latin 1) encoded (or another 8-bit encoding; 8859-15 is widespread too).

The proper way of handling it is avoiding the situation altogether [that may not be possible, unfortunately].

If the source of your data states its encoding correctly - as a website should - find out the encoding and interpret the data accordingly. If an incorrect encoding is stated, you are of course out of luck, and if no encoding is specified, you have to guess right (the natural guess nowadays is UTF-8, at least for languages using a variant of the Latin alphabet).


If avoiding the situation is not possible, the easiest ways of fixing it are

  1. replacing the occurrences of the offending sequence with the desired one before encoding:

    encodeUtf8 $ replace (pack "Fran\195\167ois") (pack "Fran\231ois") contents
    
  2. assuming everything else is ASCII or inadvertent UTF-8 too, interpret the Text code units as bytes:

    Data.ByteString.Lazy.Char8.pack $ Data.Text.Lazy.unpack contents
    

The former is more efficient, but becomes inconvenient if there are many different misencodings (caused by different accented letters, for example). The latter works only in the assumed situation (no code units above 255 in the Text) and is rather inefficient for long texts.

Daniel Fischer
  • 181,706
  • 17
  • 308
  • 431
  • “If the source of your data states its encoding correctly - as a website should - find out the encoding and interpret the data accordingly.” It's unfortunate this sentence is in the last paragraph, because it's the first thing the OP should try. – dave4420 Apr 08 '12 at 11:16
  • You're right. I rearranged the points, but I'm not sure if it's better now. Feel invited to roll back or edit if you know how to improve it. – Daniel Fischer Apr 08 '12 at 11:34
  • Thanks, the website specified an incorrect encoding, but I was able to get around it using your fix. – Peter Apr 08 '12 at 15:28
0

I am not completely sure if less can show UTF-8 encoded characters properly. GVim can. You can check this link on SO to find out how you can view UTF-8 data in gVim.

And regarding the other issue of being able to pass this to graphviz, I think you need to set the encoding on the command-line as explained in the Graph NonAscii FAQ.

From what you are explaining, I think there are no issues with how the data is being persisted. If you pass the encoding properly to graphviz, I think your problem will be resolved.

P.S: Creating an answer since it is easier to create descriptive links

Community
  • 1
  • 1
Gangadhar
  • 1,893
  • 9
  • 9