How to properly get the content of a website?

Question

I'm trying to read the content of the page and extract some information. But sometimes I got stuff like : nbsp;Aur& eacute;lie (Verschuere)

I already do this:

string siteContent = "";
using (System.Net.WebClient client = new System.Net.WebClient())
{
    client.Encoding = System.Text.Encoding.UTF8;    
    siteContent = client.DownloadString(edtReadFromUrl.Text);
}

It works when there are UTF-8 characters. Can't I get a readable text? with no HTML in it? It would be even easier.

Edit: It's not the same as someone marked it. It does return strange characters with the other solution too.

There are already other answers that cover [downloading the HTML of a webpage in C#](http://stackoverflow.com/questions/16642196/get-html-code-from-a-website-c-sharp). To get just the text and not HTML you'd need to look at [HTML Agility Pack](https://htmlagilitypack.codeplex.com/). — Equalsk, Nov 12 '15 at 10:34
@Equalsk It does the exact same thing. get the content with ;nbsp & eacute; — user5014677, Nov 12 '15 at 10:37
Well and é are genuine. Are you sure they're not simply meant to be in the source code? — Equalsk, Nov 12 '15 at 10:39

score 0 · Answer 1 · answered Nov 12 '15 at 10:38

You could use an html parser to extract meaning. For instance, with HtmlAgilityPack, you could:

HtmlDocument doc=new HtmlDocument();
string html;
using(var wc=new WebClient())
{
    html=wc.DownloadString("http://www.bbc.co.uk/news");
}
doc.LoadHtml(html);
doc.DocumentNode.Element("html").Element("body").InnerText

How to properly get the content of a website?

1 Answers1