Why am I picking up foreign characters and how can I remove them?

Question

I am picking up extra characters (Â) compared to the source when I grab the InnerText of a H3 tag using the HTML Agility Pack.

I am not sure where these characters are coming from or how to remove them.

Extracted String:

Â WeekÂ 1

HTML Source:

<h3>
<span> </span>Week 1</h3>

Current Code:

private void getWeekNumber(string url)
{
    HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

    htmlDoc.Load(new System.IO.StringReader(url));

    foreach (HtmlAgilityPack.HtmlNode h3 in htmlDoc.DocumentNode.SelectNodes("//h3"))
    {
        MessageBox.Show(h3.InnerText);
    }
}

Current Workaround (Stolen from somewhere on stackoverflow, lost the link):

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);

request.Method = "GET";

using (var stream = request.GetResponse().GetResponseStream())
using (var reader = new System.IO.StreamReader(stream, Encoding.UTF8))
{
    result = reader.ReadToEnd();
}

HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

htmlDoc.Load(new System.IO.StringReader(result));

foreach (HtmlAgilityPack.HtmlNode h3 in htmlDoc.DocumentNode.SelectNodes("//h3"))
{
    MessageBox.Show(h3.InnerText);
}

Must make border crossings difficult, picking up foreign characters like that... — corsiKa, Jul 19 '12 at 14:27

score 4 · Accepted Answer · edited May 23 '17 at 12:18

4

You need to set the encoding before you do...

htmlDoc.Load(new System.IO.StringReader(url), Encoding.UTF8);

This tells the agility pack that the characters are UTF8 rather than some other encoding.

The reason you need to do it here is that this is the point when it is parsed incorretly. After this you are storing the literal Â characters.

Characters in string changed after downloading HTML from the internet may also be of interest.

edited May 23 '17 at 12:18

Community

1
1

answered Jul 19 '12 at 14:57

Chris

27,210
6
71
92

The only overloaded method match for HtmlDocument.Load was (string, encoding). I cannot get it to work for StringReader. – deepseapanda Jul 19 '12 at 15:30
Oops, sorry. That was me being careless and copying and pasting blindly... :) – Chris Jul 19 '12 at 16:01
Pointed me in the right direction though, might have it working soon :) – deepseapanda Jul 19 '12 at 16:16
Can you confirm what exactly is in your stringreader (ie in url)? I had assumed it was the url but the documentation (which is pretty rubbish) is very vague. It looks like the stringreader overload is used to pass html in, not a url. And if url is your html then that's not the best way to load it either - there is a LoadHtml method. – Chris Jul 19 '12 at 16:20
Yeah I was using it to load the html, misnamed variable :P I managed to get it working, not sure if it is the best way to go about it though. I'll add the code to my question. – deepseapanda Jul 19 '12 at 18:32
You're going through a few too many hoops but you've got the right idea that the encoding needs to be done the moment you get the text. For what its worth it looks to me like you might just want to look at the load method that takes a stream and an encoding to save you having to go via the stream reader and string reader. – Chris Jul 20 '12 at 08:53

score 1 · Answer 2 · answered Jul 19 '12 at 14:24

1

may be your character encoding, set encoding to UTF-8

answered Jul 19 '12 at 14:24

I tried it and got the same result, I added the code I used to attempt it above. – deepseapanda Jul 19 '12 at 14:52

Why am I picking up foreign characters and how can I remove them?

2 Answers2