0

There are some posts regarding encoding questions and HtmlAgilityPack but this issue wasn't addressed:

Because the website I try to parse contains Unicode symbols like or ä, ü I tried to set the encoding to Unicode:

public class WebpageDeserializer
{
    public WebpageDeserializer() {}

    /*
     * Example address: https://www.dslr-forum.de/showthread.php?t=1930368
    */
    public static void Deserialize(string address)
    {
        var web = new HtmlWeb();
        web.OverrideEncoding = Encoding.Unicode;
        var htmlDoc = web.Load(address);
        //further decoding fails because unicode decoded characters are not proper html (looks more like chinese)
    }
}

But now

htmlDoc.DocumentNode.InnerHtml

looks like this:

ℼ佄呃偙⁅瑨汭倠䉕䥌⁃ⴢ⼯㍗⽃䐯䑔堠呈䱍ㄠ〮吠慲獮瑩潩慮⽬䔯≎...

If I try to use UTF-8 or iso-8859-1 the symbol is converted to (as well as ä, ö, ü). How can I fix this?

binaryBigInt
  • 1,526
  • 2
  • 18
  • 44

2 Answers2

1

Your site is mis-configured and the real encoding is cp1252.

Below code should work:

var client = new HttpClient();
var buf = await client.GetByteArrayAsync("https://www.dslr-forum.de/showthread.php?t=1930368");
var html = Encoding.GetEncoding(1252).GetString(buf);
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
L.B
  • 114,136
  • 19
  • 178
  • 224
  • Thanks for your reply. `Encoding.GetEncoding(1252);` gives me a `System.NotSupportedException`. Do I have to configure something to get this encoding? I am using `.NET Core 2.1` and `Windows 10 64-bit`. Edit: This fixed it: https://stackoverflow.com/questions/37870084/net-core-doesnt-know-about-windows-1252-how-to-fix `Thanks alot!` – binaryBigInt Dec 09 '18 at 09:43
0

instead Encoding.Unicode use:

web.OverrideEncoding = Encoding.GetEncoding("iso-8859-1");

(tested with your website and german umlauts)

to get the right encoding check the header of the target website. it contains the right hint:

<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
Falco Alexander
  • 3,092
  • 2
  • 20
  • 39