KeyNotFoundException with using HtmlEntity.DeEntitize() method

Question

I am currently working on a scraper written in C# 4.0. I use variety of tools, including the built-in WebClient and RegEx features of .NET. For a part of my scraper I am parsing a HTML document using HtmlAgilityPack. I got everything to work as I desired and went through some cleanup of the code.

I am using the HtmlEntity.DeEntitize() method to clean up the HTML. I made a few tests and the method seemed to work great. But when I implemented the method in my code I kept getting KeyNotFoundException. There are no further details so I'm pretty lost. My code looks like this:

WebClient client = new WebClient();
string html = HtmlEntity.DeEntitize(client.DownloadString(path));
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);

The HTML downloaded is UTF-8 encoded. How can I get around the KeyNotFound exception?

score 3 · Answer 1 · edited Mar 20 '17 at 20:58

3

I understand that the problem is due to occurrence of non-standard characters. Say, for example, Chinese, Japanese etc.

After you find out that what characters are causing a problem, perhaps you could search for the suitable patch to htmlagilitypack here

This may be of some help to you in case you want to modify the htmlagilitypack source yourself.

edited Mar 20 '17 at 20:58

Alexei - check Codidact

22,016
16
145
164

answered Nov 18 '12 at 15:33

Shoaib Mohamed

120
8

Alexei - check Codidact · Answer 2 · 2017-03-20T21:37:03.713

3

Four years later and I have the same problem with some encoded characters (version 1.4.9.5). In my case, there is a limited set of characters that might generate the problem, so I have just created a function to perform the replacements:

// to be called before HtmlEntity.DeEntitize
public static string ReplaceProblematicHtmlEntities(string str)
{
    var sb = new StringBuilder(str);
    //TODO: add other replacements, as needed
    return sb.Replace("&period;", ".")
        .Replace("&abreve;", "ă")
        .Replace("&acirc;", "â")
        .ToString();
}

In my case, the string contains both html-encoded characters and UTF-8 characters, but the problem is related to some encoded characters only.

This is not an elegant solution, but a quick fix for all those text with a limited (and known) amount of problematic encoded characters.

edited Mar 20 '17 at 21:37

answered Mar 20 '17 at 20:56

Alexei - check Codidact

22,016
16
145
164

Out of curiosity I tried these cases with `HttpUtility.HtmlDecode' and it only handled the last case of "â" – Setsu Mar 20 '17 at 21:08
@Setsu - I did not try each character. Based on my input text (Romanian language only), I know the set of problematic characters and put them all within the function. However, one should adapt as needed. This is not a decent solution, but it enables HtmlAgillityPack to do its magic afterwards. – Alexei - check Codidact Mar 20 '17 at 21:39
Perhaps I'm wrong but I think you mistook what I meant by that comment. `HttpUtility.HtmlDecode` lives in the `System.Web` namespace and is provided by the framework, instead of HtmlAgilityPack. I was just curious to see if it handled those cases. – Setsu Mar 20 '17 at 21:53
@Setsu - yes, sorry. You are right. I have tried `HttpUtility.HtmlDecode` and works only partially. – Alexei - check Codidact Mar 21 '17 at 05:19

score 3 · Answer 3 · answered May 10 '17 at 19:29

My HTML had a block of text like so:

... found in sections: 233.9 & 517.3; ...

Despite the spacing and decimal point, it was interpreting & 517.3; as a unicode character.

Simply HTML Encoding the raw text fixed the problem for me.

string raw = "sections: 233.9 & 517.3;";
// turn '&' into '&amp;', etc, before DeEntitizing
string encoded = System.Web.HttpUtility.HtmlEncode(raw);
string deEntitized = HtmlEntity.DeEntitize(encoded);

score 0 · Answer 4 · answered Aug 22 '18 at 04:49

0

In my case I have fixed this by updating HtmlAgilityPack to version 1.5.0

answered Aug 22 '18 at 04:49

rajeemcariazo

2,476
5
36
62

KeyNotFoundException with using HtmlEntity.DeEntitize() method

4 Answers4