ByteStrings, Text, and encoding in Haskell

Question

I wish to grab an input text using the IO functionality of Data.Text. My quandry has to do with encoding discovery. That is, if I am not aware of the encoding of the text before-hand, how is the IO in Data.Text of any use at all in situations where the encoding of the text being read is different than the system locale setting? Is there an encoding discovery mechanism somewhere in Data.Text?

I know I might get a bunch of responses that say "use Data.ByteString", but wasn't Data.Text created for the purpose of getting away from the use of Data.ByteString for reading text?

Also, if I must use Data.ByteString, does anyone know what happens when octets 0x80 to 0x9f are read? Are they read in as expected like the rest of the input? They are undefined in ISO-8859-1, and Data.ByteString's IO seems to indicate that input is treated as if the source is ISO-8859-1.

*"Is there an encoding discovery mechanism somewhere in Data.Text?"* [No](http://stackoverflow.com/a/90956/1139697). — Zeta, Dec 21 '13 at 08:14
Where did you see something indicating that inputting a ByteString will treat the input as ISO-8859-1? — Ganesh Sittampalam, Dec 21 '13 at 14:00
In the spec of [Data.ByteString]( http://hackage.haskell.org/package/bytestring-0.10.4.0/docs/Data-ByteString.html) - also present in its lazy and char8 variants - under the definition of hGetContents — Mike Menzel, Dec 21 '13 at 16:38
Hmm. I don't understand that statement and I wonder if it's mistaken/obsolete. I would expect that opening the file in binary mode would lead to no encoding changes at all. — Ganesh Sittampalam, Dec 21 '13 at 17:00

score 5 · Answer 1 · answered Dec 21 '13 at 14:25

You’ll want to use ByteString for reading bytes, and, for example:

decodeUtf8' :: ByteString -> Either UnicodeException Text

From Data.Text.Encoding to actually decode the raw data and handle any encoding errors. There is no predefined mechanism in text for guessing encoding, but you can try to decode multiple times, or use ICU’s character set detection facilities. Unfortunately, that functionality is not currently available in text-icu, so you’ll need to import it yourself.

Thanks. I was just wondering if there was something less clunky than that, but I suppose it will have to do. — Mike Menzel, Dec 21 '13 at 16:39

score 3 · Answer 2 · answered Dec 21 '13 at 14:00

3

If you don't know the encoding in advance, I think using Data.ByteString and reading in binary mode is exactly the right thing to do. You should get the input data exactly as bytes including octets 0x80 to 0x9f.

Data.Text is the right way to represent something with a known encoding, or rather in decoded form, but if you can't do the decoding on read then I don't think it makes sense to use it at that point.

If your code can later learn or guess the encoding appropriately that's the right time to make the switch.

answered Dec 21 '13 at 14:00

Ganesh Sittampalam

28,821
4
79
98

Thanks. I was wondering if there was some way of getting around it, but I guess not. – Mike Menzel Dec 21 '13 at 16:42
1

I guess what I'm saying is that there's nothing to get around :-) ByteString is the right representation until you know the encoding. – Ganesh Sittampalam Dec 21 '13 at 16:58

ByteStrings, Text, and encoding in Haskell

2 Answers2