1

I use Q42.Winrt library to download html file to cache. But when i use ReadTextAsync i have exception:

No mapping for the Unicode character exists in the target multi-byte code page. (Exception from HRESULT: 0x80070459)

My code very simple

var parsedPage = await WebDataCache.GetAsync(new Uri(String.Format("http://someUrl.here")));
var parsedStream = await FileIO.ReadTextAsync(parsedPage);

I open downloaded file and there is ANSII encoding. I think i need to convert it to UTF-8 but i don't know how.

TheX
  • 317
  • 5
  • 18
  • The error doesn't seem to mesh with your observation that it's ANSI (how did you determine that?), but regardless, [ReadTextAsync](http://msdn.microsoft.com/en-us/library/windows/apps/hh701706.aspx) has an overload that allows you to provide a Unicode encoding to match the source file. Perhaps that will get you further? – Jim O'Neil May 20 '13 at 05:38
  • I open downloaded file in Notepad++ and see ANSI encoding. I try overloaded ReadTextAsync and it did not help. – TheX May 20 '13 at 06:07
  • do you have the file/url that we can look at? – Jim O'Neil May 20 '13 at 13:46
  • Yes. Please try to download [link](http://bash.im) – TheX May 20 '13 at 17:39
  • please confirm the file link, that doesn't look right – Jim O'Neil May 20 '13 at 23:23
  • Yes, it's right link. i can download html file. WebDataCache.GetAsync() method just download html source by link and put it to local storage. – TheX May 21 '13 at 20:46

1 Answers1

6

The problem is that the encoding of the original page is not in Unicode, it's Windows-1251, and the ReadTextAsync function only handles Unicode or UTF8. The way around this is to read the file as binary and then use Encoding.GetEncoding to interpret the bytes with the 1251 code page and produce the string (which is always Unicode).

For example,

        String parsedStream;
        var parsedPage = await WebDataCache.GetAsync(new Uri(String.Format("http://bash.im")));

        var buffer = await FileIO.ReadBufferAsync(parsedPage);
        using (var dr = DataReader.FromBuffer(buffer))
        {
            var bytes1251 = new Byte[buffer.Length];
            dr.ReadBytes(bytes1251);

            parsedStream = Encoding.GetEncoding("Windows-1251").GetString(bytes1251, 0, bytes1251.Length);
        }

The challenge is you don't know from the stored bytes what the code page is, so it works here but may not work for other sites. Generally, UTF-8 is what you'll get from the web, but not always. The Content-Type response header of this page shows the code page, but that information isn't stored in the file.

Jim O'Neil
  • 23,344
  • 7
  • 42
  • 67