ByteString encoding detection

Question

As an exercise, I want to write an XML parser (I know there are lots of really good libraries out there, but I want to try it myself). I understand that ByteString.Lazy is probably the best option for any xml file sufficiently large because in memory, plain strings are lists of unicode points. My question is: should I use Data.Text.Lazy.Encoding.decodeUtf8With as a pre-processor or simply pass encoding detection straight to the parser?

score 3 · Answer 1 · edited May 23 '17 at 12:28

This is a tricky issue.... The encoding of an XML document is specified in the document itself (in the processing instruction). This obviously leads to a chicken and egg problem, described here- What use is the 'encoding' in the XML header?

So, if you want to do things correctly, you first have to figure out how to read the first line of the document (is it one byte per char or 2 bytes), then read it, then read the rest of the text using the correct encoding. Luckily, the first line will be chars in the range 32-127, so that makes things a bit simpler.

If it were me, and I was doing this as a learning excercise, I would just restrict the doc to utf8.... (the details here are just plumbing).

score 3 · Accepted Answer · answered Dec 18 '13 at 01:57

3

How to do this is specified in the XML standard itself, although this is a non-normative appendix (i.e. you're allowed to do it another way).

Reproducing the algorithm here would be redundant, so I suggest just following the link above.

answered Dec 18 '13 at 01:57

porges

30,133
4
83
114

I am aware of this. I just wanted to know if I'm gaining very much in efficiency by doing the non-normative detection myself or letting the library do it. – Mike Menzel Dec 18 '13 at 02:18
Well `decodeUtf8With` isn't going to help you if the input is UTF-16, for example. You can use the algorithm in the standard to decide what the input is and then use one of the `decode___With` functions before running your parser over the decoded `Text`. I think having the encoding handled as a separate step is going to lead to a cleaner core parser. – porges Dec 18 '13 at 02:21
Thanks. I suppose looking at the first bytes and grabbing the encoding declaration into a [Char] string won't be too costly. As for non-unicode encodings, I just hope that Data.Text.ICU does not cause the memory use to grow exponentially. – Mike Menzel Dec 18 '13 at 10:36

ByteString encoding detection

2 Answers2