How to solve encoding problem reading feed

Question

https://sports.ultraplay.net/sportsxml?clientKey=b4dde172-4e11-43e4-b290-abdeb0ffd711&sportId=1165

I'm trying to read this feed in .NET environment and get the BOM issue (System.Xml.XmlException: 'There is no Unicode byte order mark. Cannot switch to Unicode.). How can I solve it? Is it because the xml contents doesn't have an xml declaration tag?

I tried reading the feed all of the possible ways, lets give as an example this one:

XmlReader reader = XmlReader.Create(feedUrl);
var content = XDocument.Load(reader);

dana · Answer 1 · 2018-11-02T16:32:13.387

2

Apparently the XML Declaration seems to be throwing things off here:

<?xml version="1.0" encoding="utf-16"?>

See: Loading xml with encoding UTF 16 using XDocument

That question addresses the scenario when you have an XML File using StreamReader. Since you are downloading the file from the web, you can adapt a WebClient to a StreamReader using the OpenRead() method as follows:

string feedUrl = "https://sports.ultraplay.net/sportsxml?clientKey=b4dde172-4e11-43e4-b290-abdeb0ffd711&sportId=1165";

System.Xml.Linq.XDocument content;
using (System.Net.WebClient webClient = new System.Net.WebClient())
using (System.IO.Stream stream = webClient.OpenRead(feedUrl))
using (System.IO.StreamReader streamReader = new System.IO.StreamReader(stream, Encoding.UTF8))
{
    content = XDocument.Load(streamReader);
}
Console.WriteLine(content);

Strangely enough, while the document claims to be UTF-16, the HTTP response say UTF-8 which is why I am specifying that in the StreamReader constructor.

HTTP/1.1 200 OK
Date: Fri, 02 Nov 2018 16:28:46 GMT
Content-Type: application/xml; charset=utf-8

This seems to work well :)

edited Nov 02 '18 at 16:32

answered Nov 01 '18 at 20:37

dana

17,267
6
64
88

Well, it's UTF8-Encoded because you ask it to be that way. It doesn't mean that the original page encoding was UTF-8 (actually, it's UTF-16). WebClient uses the specified encoding to encode the result data bytes. It doesn't check whether it matches the Response Encoding. If that page was using a different, specific, encoding, you'll get garbled text. Something related I posted: [Kanji characters from WebClient html different from actual Kanji](https://stackoverflow.com/questions/49846392/kanji-characters-from-webclient-html-different-from-actual-kanji-in-website?answertab=active#tab-top) – Jimi Nov 02 '18 at 00:44
If you don't specify and Encoding, the procedure checks the BOM of these: `Encoding.UTF8, Encoding.UTF32, Encoding.Unicode, Encoding.BigEndianUnicode`. Sure thing is, web pages tend to be UTF-8 encoded. But many are not. – Jimi Nov 02 '18 at 00:47
For what it's worth, I received the header in Fiddler. Also, I also tried WebClient with `Encoding.Unicode` but it barfed. UTF-8 seemed to do the trick. – dana Nov 02 '18 at 01:02
My comment is not specific to this question. It's good to know that the Encoding specified using the `WebClient` property does not guarantee that the downloaded data will be encoded correctly. Quite the opposite. It's probably better not to specify an Encoding. Unless one is sure what that is. In the answer I linked this is reported. Also, the underlying `WebResponse` is used to get the actual Encoding, provided by the remote host. The `byte[]` data is then re-encoded using the correct Encoding. The current question is more related to the `XmlReader` behaviour with a file Encoding. – Jimi Nov 02 '18 at 01:14
The problem I have seen when not specifying an `Encoding` is that `Encoding.Default` will be used. This encodes ASCII characters OK, but screws up a lot of others. So you won't know that you have a bug until you see a few odd characters pop up one day. Per the OP, BOM detection seems to fail in this case, so specifying UTF-8 seems like the right thing to do here. – dana Nov 02 '18 at 02:09
Yes, maybe it wasn't that clear. You could avoid setting the Encoding, leaving the Default, then re-encode when you know the actual encoding used to encode the text of the string you just downloaded. In the example I posted, the text is encoded in Japanese (`System.Text.EUCJPEncoding`). The HTML page is downloaded using the Default encoding. Then re-encoded with the correct encoding using a `MemoryStream -> StreamReader`. You could also use the `Encoding.Convert()` method. This (tries to) ensures that the text is encoded as the original. As a note, the real-life code is more complex than that. – Jimi Nov 02 '18 at 02:34
1

Anyway, as per the OP, this will probably solve the immediate problem (encoding in a Unicode form, to accomodate the `XmlReader` requirements, if that is the tool). If the text is not completely *right*, there's enough info here to understand why and fix it. – Jimi Nov 02 '18 at 02:46

How to solve encoding problem reading feed

1 Answers1