0

I'm trying to read the content of the page and extract some information. But sometimes I got stuff like : nbsp;Aur& eacute;lie (Verschuere)

I already do this:

string siteContent = "";
using (System.Net.WebClient client = new System.Net.WebClient())
{
    client.Encoding = System.Text.Encoding.UTF8;    
    siteContent = client.DownloadString(edtReadFromUrl.Text);
}

It works when there are UTF-8 characters. Can't I get a readable text? with no HTML in it? It would be even easier.

Edit: It's not the same as someone marked it. It does return strange characters with the other solution too.

Alexander
  • 9,737
  • 4
  • 53
  • 59
user5014677
  • 694
  • 6
  • 22
  • There are already other answers that cover [downloading the HTML of a webpage in C#](http://stackoverflow.com/questions/16642196/get-html-code-from-a-website-c-sharp). To get just the text and not HTML you'd need to look at [HTML Agility Pack](https://htmlagilitypack.codeplex.com/). – Equalsk Nov 12 '15 at 10:34
  • @Equalsk It does the exact same thing. get the content with ;nbsp & eacute; – user5014677 Nov 12 '15 at 10:37
  • Well   and é are genuine. Are you sure they're not simply meant to be in the source code? – Equalsk Nov 12 '15 at 10:39

1 Answers1

0

You could use an html parser to extract meaning. For instance, with HtmlAgilityPack, you could:

HtmlDocument doc=new HtmlDocument();
string html;
using(var wc=new WebClient())
{
    html=wc.DownloadString("http://www.bbc.co.uk/news");
}
doc.LoadHtml(html);
doc.DocumentNode.Element("html").Element("body").InnerText
spender
  • 117,338
  • 33
  • 229
  • 351