8

I am currently working on a scraper written in C# 4.0. I use variety of tools, including the built-in WebClient and RegEx features of .NET. For a part of my scraper I am parsing a HTML document using HtmlAgilityPack. I got everything to work as I desired and went through some cleanup of the code.

I am using the HtmlEntity.DeEntitize() method to clean up the HTML. I made a few tests and the method seemed to work great. But when I implemented the method in my code I kept getting KeyNotFoundException. There are no further details so I'm pretty lost. My code looks like this:

WebClient client = new WebClient();
string html = HtmlEntity.DeEntitize(client.DownloadString(path));
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);

The HTML downloaded is UTF-8 encoded. How can I get around the KeyNotFound exception?

Sebastian Brandes
  • 726
  • 1
  • 6
  • 15

4 Answers4

3

I understand that the problem is due to occurrence of non-standard characters. Say, for example, Chinese, Japanese etc.

After you find out that what characters are causing a problem, perhaps you could search for the suitable patch to htmlagilitypack here

This may be of some help to you in case you want to modify the htmlagilitypack source yourself.

Alexei - check Codidact
  • 22,016
  • 16
  • 145
  • 164
3

Four years later and I have the same problem with some encoded characters (version 1.4.9.5). In my case, there is a limited set of characters that might generate the problem, so I have just created a function to perform the replacements:

// to be called before HtmlEntity.DeEntitize
public static string ReplaceProblematicHtmlEntities(string str)
{
    var sb = new StringBuilder(str);
    //TODO: add other replacements, as needed
    return sb.Replace(".", ".")
        .Replace("ă", "ă")
        .Replace("â", "â")
        .ToString();
}

In my case, the string contains both html-encoded characters and UTF-8 characters, but the problem is related to some encoded characters only.

This is not an elegant solution, but a quick fix for all those text with a limited (and known) amount of problematic encoded characters.

Alexei - check Codidact
  • 22,016
  • 16
  • 145
  • 164
  • Out of curiosity I tried these cases with `HttpUtility.HtmlDecode' and it only handled the last case of "â" – Setsu Mar 20 '17 at 21:08
  • @Setsu - I did not try each character. Based on my input text (Romanian language only), I know the set of problematic characters and put them all within the function. However, one should adapt as needed. This is not a decent solution, but it enables HtmlAgillityPack to do its magic afterwards. – Alexei - check Codidact Mar 20 '17 at 21:39
  • Perhaps I'm wrong but I think you mistook what I meant by that comment. `HttpUtility.HtmlDecode` lives in the `System.Web` namespace and is provided by the framework, instead of HtmlAgilityPack. I was just curious to see if it handled those cases. – Setsu Mar 20 '17 at 21:53
  • @Setsu - yes, sorry. You are right. I have tried `HttpUtility.HtmlDecode` and works only partially. – Alexei - check Codidact Mar 21 '17 at 05:19
3

My HTML had a block of text like so:

... found in sections: 233.9 & 517.3; ...

Despite the spacing and decimal point, it was interpreting & 517.3; as a unicode character.

Simply HTML Encoding the raw text fixed the problem for me.

string raw = "sections: 233.9 & 517.3;";
// turn '&' into '&', etc, before DeEntitizing
string encoded = System.Web.HttpUtility.HtmlEncode(raw);
string deEntitized = HtmlEntity.DeEntitize(encoded);
djs
  • 1,660
  • 1
  • 17
  • 35
0

In my case I have fixed this by updating HtmlAgilityPack to version 1.5.0

rajeemcariazo
  • 2,476
  • 5
  • 36
  • 62