5

The documentation for Data.ByteString.hGetContents says

As with hGet, the string representation in the file is assumed to be ISO-8859-1.

Why should it have to "assume" anything about the "string representation in the file"? The data is not necessarily strings or encoded text at all. If I wanted something to deal with encoded text I'd use Data.Text or perhaps Data.ByteString.Char8. I thought the whole point of ByteString is that the data is handled as a list of 8-bit bytes, not as text characters. What is the impact of the assumption that it is ISO-8859-1?

massysett
  • 1,100
  • 6
  • 13
  • [This answer](http://stackoverflow.com/a/2087855/247020) seems to have some helpful background info. – sanityinc Nov 05 '13 at 16:01
  • 2
    I suspect this documentation was written back when `ByteString` was intended to be "a faster `String`". Since then we've learned our lesson -- we have actual types that really are faster `String`s, and `ByteString` is only for sequences of bytes. But it wasn't always that way. – Daniel Wagner Nov 05 '13 at 16:38

2 Answers2

5

It's a roundabout way to say the same thing - no decoding is performed (since the encoding is 8-bit, nothing needs to be done), so hGetContents gives you bytes in range 0x00 - 0xFF:

$ cat utf-8.txt
ÇÈÄ
$ iconv -f iso8859-1 iso8859-1.txt                         
ÇÈÄ
$ ghci
> openFile "iso8859-1.txt" ReadMode >>= (\h -> fmap BS.unpack $ BS.hGetContents h)
[199,200,196,10]
> openFile "utf-8.txt" ReadMode >>= (\h -> fmap BS.unpack $ BS.hGetContents h)
[195,135,195,136,195,132,10]
Mikhail Glushenkov
  • 14,928
  • 3
  • 52
  • 65
0

Perhaps it's similar to this, then:

There are situations where encodings are handled incorrectly but things still work. An often-encountered situation is a database that's set to latin-1 and an app that works with UTF-8 (or any other encoding). Pretty much any combination of 1s and 0s is valid in the single-byte latin-1 encoding scheme. If the database receives text from an application that looks like 11100111 10111000 10100111, it'll happily store it, thinking the app meant to store the three latin characters "縧". After all, why not? It then later returns this bit sequence back to the app, which will happily accept it as the UTF-8 sequence for "縧", which it originally stored. The database admin interface automatically figures out that the database is set to latin-1 though and interprets any text as latin-1, so all values look garbled only in the admin interface.

massysett
  • 1,100
  • 6
  • 13